Skip to main content

Command Palette

Search for a command to run...

Cross-Validation, Pipelines & Hyperparameter Tuning with Scikit-learn

Updated
26 min read
Cross-Validation, Pipelines & Hyperparameter Tuning with Scikit-learn
H

Data Scientist, write about: Tech & Business & Lifeskills

When building a machine learning model, it's tempting to train it on your data, check the score, and call it done. But that score is misleading — the model has already "seen" the data it's being evaluated on, so strong performance doesn't mean it will hold up on new, unseen examples. This is the problem of overfitting: a model that memorizes the training data rather than learning generalizable patterns.

The standard remedy is to hold out a portion of your data for testing. But a single train-test split introduces its own problem — your evaluation depends heavily on which rows happened to land in the test set. A lucky split can make a weak model look strong, and an unlucky one can do the opposite.

Cross-validation solves this by rotating the holdout set systematically. Instead of one fixed split, you divide the data into k folds and train k separate models, each time holding out a different fold for evaluation. The scores are then averaged to get a more reliable estimate of how the model will generalize.

In this tutorial, we'll build up cross-validation step by step using scikit-learn and the Boston Housing dataset. We'll start with manual K-Fold splitting, then streamline the workflow with Pipeline and cross_val_predict, and finally tackle hyperparameter tuning using GridSearchCV — covering both Lasso and Ridge regression along the way.

1. The Boston Housing Dataset

We'll work with the Boston Housing dataset stored as a pickle file — a convenient format for saving and loading Python objects directly. The file contains a dictionary with two keys: 'dataframe', which holds the actual data as a pandas DataFrame, and 'description', which provides a text summary of each feature.

import numpy as np
import pickle
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline

boston = pickle.load(open('boston_housing_clean.pickle', 'rb'))

boston_data = boston['dataframe']
boston_description = boston['description']

boston_data.head()

Output:

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

The DataFrame contains 13 input features — things like crime rate (CRIM), average number of rooms (RM), and pupil-teacher ratio (PTRATIO) — along with the target variable MEDV, which represents the median home value in thousands of dollars. Our goal throughout this tutorial is to predict MEDV as accurately as possible, while ensuring that accuracy holds up on data the model has never seen.

2. K-Fold Cross-Validation from Scratch

Before using any of scikit-learn's convenience functions, it's worth building K-Fold cross-validation manually so the mechanics are clear.

We start by separating features from the target:

# Separate features and target variable
X = boston_data.drop('MEDV', axis=1)
y = boston_data.MEDV

Then we create a KFold object with 3 splits:

# shuffle=True randomly assigns rows to folds instead of taking them in order
# random_state ensures reproducibility
kf = KFold(shuffle=True, random_state=72018, n_splits=3)

Setting shuffle=True means rows are randomly assigned to folds rather than taken in order — important when the data has any inherent ordering. We can inspect what the splits look like:

# Print the first 10 indices and total size of each train/test split
for train_index, test_index in kf.split(X):
    print("Train index:", train_index[:10], len(train_index))
    print("Test index:", test_index[:10], len(test_index))
    print('')

Output:

Train index: [ 1  3  4  5  7  8 10 11 12 13] 337
Test index: [ 0  2  6  9 15 17 19 23 25 26] 169

Train index: [ 0  2  6  9 10 11 12 13 15 17] 337
Test index: [ 1  3  4  5  7  8 14 16 22 27] 169

Train index: [0 1 2 3 4 5 6 7 8 9] 338
Test index: [10 11 12 13 18 20 21 24 28 31] 168

With 506 rows and 3 folds, each test set contains roughly 169 rows (1/3 of the data), and the training set contains the remaining ~337 rows (2/3). Crucially, the test indices across all three folds are completely non-overlapping — every row gets used for testing exactly once. The training sets, by contrast, can and do share rows.

Now we loop through each fold, fit a LinearRegression model on the training portion, predict on the test portion, and collect the R² score:

scores = []
lr = LinearRegression()

for train_index, test_index in kf.split(X):
    # Slice features and target into train/test sets for this fold
    X_train, X_test, y_train, y_test = (X.iloc[train_index, :],
                                        X.iloc[test_index, :],
                                        y[train_index],
                                        y[test_index])
    # Fit the model on the training portion
    lr.fit(X_train, y_train)

    # Predict on the held-out test portion
    y_pred = lr.predict(X_test)

    # Evaluate and store the R² score for this fold
    score = r2_score(y_test.values, y_pred)
    scores.append(score)

scores

Output:

[0.6719348798472737, 0.748502005921238, 0.6976807323597767]

You'll notice the three scores are noticeably different from one another. This is precisely why a single train-test split can be misleading — the result depends heavily on which rows end up in the test set. Averaging the scores across all folds gives a much more stable and trustworthy estimate of model performance.

3. Adding Preprocessing with StandardScaler

In practice, you'll rarely feed raw data directly into a model. A common preprocessing step is feature scaling — transforming each feature to have zero mean and unit variance using StandardScaler. This is especially important for regularized models like Lasso and Ridge, which we'll introduce later.

The key rule when scaling inside a cross-validation loop is: fit the scaler only on the training fold, then apply it to the test fold. Fitting on the full dataset before splitting would leak information from the test set into the scaler, giving an overly optimistic evaluation.

scores = []

lr = LinearRegression()
s = StandardScaler()

for train_index, test_index in kf.split(X):
    X_train, X_test, y_train, y_test = (X.iloc[train_index, :],
                                        X.iloc[test_index, :],
                                        y[train_index],
                                        y[test_index])

    # Fit the scaler on training data only, then transform it
    X_train_s = s.fit_transform(X_train)

    lr.fit(X_train_s, y_train)

    # Apply the same scaling (learned from train) to the test set
    X_test_s = s.transform(X_test)

    y_pred = lr.predict(X_test_s)
    score = r2_score(y_test.values, y_pred)
    scores.append(score)

scores

Output:

[0.6719348798472715, 0.748502005921238, 0.6976807323597745]

The scores will be identical to Section 2 — and that's expected. For vanilla linear regression without regularization, scaling doesn't affect predictive performance. However, once we introduce Lasso or Ridge, scaling becomes essential for the model to work correctly.

That said, manually managing fit_transform on train and transform on test for every fold is tedious and error-prone. As we add more preprocessing steps, this pattern gets unwieldy fast. Fortunately, scikit-learn has an elegant solution.

4. Streamlining with Pipeline

A Pipeline chains multiple processing steps into a single object. Each step except the last must implement both fit and transform; the final step only needs fit. The output of each step becomes the input to the next, so you can stack as many transformations as needed.

s = StandardScaler()
lr = LinearRegression()

# Define the pipeline: scale first, then fit linear regression
estimator = Pipeline([
    ("scaler", s),
    ("regression", lr)
])

Each step is defined as a tuple of ("name", object). The name is a label you choose — it's used to identify the step and, as we'll see later, to reference hyperparameters in grid search.

Once the pipeline is defined, it behaves just like any other scikit-learn estimator:

# Fitting the pipeline automatically applies fit_transform on train data
estimator.fit(X_train, y_train)

# Predicting automatically applies transform before passing to the model
estimator.predict(X_test)

Output:

array([19.44230308, 22.8687781 , 20.72201438, 20.19685225, 16.02553328,
       13.19670707, 18.48717304, 16.41260539, 20.08040483, 19.16027923,
       23.03341265, 23.52809897, 24.64808538, 23.84554003, 22.37992895,
       18.54423788,  9.23453025, 24.86310535, 27.35965358, 29.95093715,
       21.71928179, 18.63649236, 19.67735609, 29.98064881, 20.78659698,
       17.29484103, 20.88767674, 23.59352298, 23.01287114, 24.82516061,
       28.1550891 , 26.59019417, 28.40170962, 28.61954803, 28.99730977,
       25.46130359, 20.83346878, 21.39384122, 21.12590183, 24.80991192,
       22.64896843, 20.63582629, 20.04498128, 20.28679691, 16.04272023,
       15.75494076, 12.78067624, 11.43558055, 21.98898014, 23.31730716,
       24.28692882, 21.28789025, 14.88646576, 31.00959189, 35.37282463,
       37.57270926, 23.58942636, 24.75302163, 30.63215847, 34.85770942,
       32.94090699, 30.0536187 , 39.86200512, 29.05010705, 35.53401305,
       41.89294309, 23.54417467, 23.61752172,  8.51966664, 24.3803473 ,
       31.93213161, 35.04907462, 33.74198666, 30.95975523, 38.16572411,
       34.04572219, 25.50003895, 26.9008276 , 26.81331202, 17.52450192,
       23.0561211 , 23.35940105, 27.41662679, 37.95283941, 35.39695531,
       37.15560829, 30.16078745, 31.7987809 , 22.17460829, 36.87065984,
       29.95618041, 35.45099005, 38.96516444, 19.04113986, 27.03723525,
       33.28276705, 19.71614458, 30.7356145 , 35.74753519, 33.31789906,
       29.82085216, 19.17675208, 26.73030054, 18.85119288, 25.10080811,
       25.2310572 , 23.35918589, 20.55011533, 23.44439899, 21.43873939,
       22.28755319, 21.55265486, 22.1886005 , 30.13309578, 27.73873544,
       27.83343144, 21.22476747, 20.29426743, 16.43151094, 24.1989263 ,
       23.86222031, 20.4568203 , 22.28180574, 36.99716281,  6.35326377,
       25.54462242, 15.91166325, 14.2172422 ,  8.43052398, 21.03393529,
       19.10628664, 19.46324302, 12.20521866, 18.2420435 , 18.49378935,
       21.90422664, 13.82169579, 19.42918743,  8.44262399,  7.195685  ,
       20.36534151, 19.00376186, 17.79256504, 17.87715422, 21.34551178,
        8.19204123,  3.32788461, 11.70535034, 11.96121376, 18.89797015,
       22.06307051, 14.60280556, 23.32232899, 14.26431606, 20.57202772,
       19.80582497, 27.50273944, 19.08219115, 21.53979588, 13.11731152,
       15.37086566, 20.96805073, 20.21749107, 19.62802415, 21.55434999,
       20.98843917, 23.32746873, 26.67017418])

5. Cross-Validation with cross_val_predict

The manual loop in Section 3 works, but it's verbose. cross_val_predict condenses the entire process — splitting, fitting, and predicting across all folds — into a single function call.

Instead of returning one score per fold, cross_val_predict returns a prediction for every row in the dataset. Each prediction is made by a model that was trained on the folds that did not include that row, so no row is ever evaluated on data it was trained on. After all folds complete, you have out-of-sample predictions for the entire dataset.

# Reinitiate the KFold object and pipeline
kf = KFold(shuffle=True, random_state=72018, n_splits=3)

s = StandardScaler()
lr = LinearRegression()

estimator = Pipeline([
    ("scaler", s),
    ("regression", lr)
])

# cross_val_predict handles the fold loop internally
# cv accepts either a KFold object or a plain integer
predictions = cross_val_predict(estimator, X, y, cv=kf)

Passing the KFold object to cv gives you full control over how the data is split — including the shuffle and random state we defined earlier. If you pass a plain integer instead (e.g. cv=3), scikit-learn will still split into 3 folds but without shuffling.

The output predictions is an array with the same length as the full dataset:

# Confirm the predictions cover all 506 rows
len(predictions)

Output:

506

We can then compute a single R² score across all out-of-sample predictions:

# Evaluate overall cross-validated R² score
r2_score(y, predictions)

Output:

0.7063531064161561

One important caveat: after cross_val_predict finishes, the estimator is not in a fitted state. It internally trains multiple models across the folds but does not retain any of them. If you need to make predictions on new data afterward, you'll have to call estimator.fit(X, y) explicitly. We'll see how GridSearchCV handles this more cleanly in Section 10.

6. Hyperparameter Tuning — Lasso Regression

Plain linear regression has no hyperparameters to tune — it simply finds the best-fit coefficients directly from the data. But this also means it has no mechanism to prevent overfitting, especially when the number of features is large. Lasso regression addresses this by adding a penalty to the loss function that discourages large coefficients, controlled by a hyperparameter called alpha.

  • A low alpha → weak penalty → model behaves close to plain linear regression

  • A high alpha → strong penalty → coefficients are aggressively shrunk, some driven to exactly zero

Since alpha is something we choose rather than something the model learns, we need to search for the value that generalizes best — which is where cross-validation comes in.

To search across a wide range of values, we use np.geomspace, which generates numbers spaced evenly on a logarithmic scale:

# Generate 10 alpha values between 1e-9 and 1, spaced on a log scale
alphas = np.geomspace(1e-9, 1, num=10)
alphas

Output:

array([1.e-09, 1.e-08, 1.e-07, 1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00])

This gives us values from very small (nearly identical to linear regression) to 1 (strong regularization). We then loop over each alpha, build a pipeline, and evaluate it with cross_val_predict:

scores = []

for alpha in alphas:
    # Initiate Lasso with the current alpha value
    lasso = Lasso(alpha=alpha)

    # Always scale data before Lasso — it is sensitive to feature magnitude
    estimator = Pipeline([
        ("scaler", StandardScaler()),
        ("lasso_regression", lasso)
    ])

    # Get cross-validated predictions and compute R²
    predictions = cross_val_predict(estimator, X, y, cv=kf)
    score = r2_score(y, predictions)
    scores.append(score)

# Plot alpha values vs. cross-validated R² scores
plt.plot(alphas, scores)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('R² Score')
plt.title('Lasso: Alpha vs. Cross-Validated R²')
plt.show()

Output:

Note that scaling is mandatory here — unlike plain linear regression, Lasso is sensitive to the magnitude of features. Without scaling, features with larger numeric ranges would be penalized disproportionately, and the model would not work correctly.

Looking at the plot, the R² score stays relatively flat across most alpha values before dropping off sharply at high alphas. This tells us that for this dataset without polynomial features, regularization doesn't improve much over plain linear regression — the model isn't complex enough to need strong penalization. We'll see a much more pronounced effect once we add polynomial features in Section 7.

To understand why alpha affects performance, we can inspect the coefficients at different alpha levels:

# Low alpha — all features retain non-zero coefficients
lasso_low = Lasso(alpha=1e-9)
lasso_low.fit(StandardScaler().fit_transform(X), y)
print(lasso_low.coef_)

# High alpha — many coefficients are driven to exactly zero
lasso_high = Lasso(alpha=1)
lasso_high.fit(StandardScaler().fit_transform(X), y)
print(lasso_high.coef_)

Output:

[-0.92041112  1.08098057  0.14296711  0.68220346 -2.06009246  2.67064142
  0.02112063 -3.10444805  2.65878652 -2.07589812 -2.06215592  0.85664043
 -3.74867982]
[-0.          0.         -0.          0.         -0.          2.71335334
 -0.         -0.         -0.         -0.         -1.3435484   0.18095537
 -3.54338288]

At low alpha, all 13 features contribute to the prediction. At high alpha, Lasso zeroes out many of them entirely — effectively performing automatic feature selection by removing the least informative predictors.

7. Adding PolynomialFeatures to the Pipeline

With only 13 original features, plain Lasso has limited complexity to penalize — which is why the alpha curve in Section 6 was relatively flat. To give regularization more meaningful work to do, we can expand the feature space using polynomial features.

PolynomialFeatures with degree=2 generates two types of new features from the originals: each feature squared (e.g. RM²), and all pairwise interaction terms (e.g. RM × LSTAT). For 13 original features, this expands the dataset from 13 columns to 104.

In practice, you wouldn't blindly apply polynomial expansion to every feature — some combinations carry genuine business meaning while others don't. For example:

  • RM² (rooms squared) makes intuitive sense — the premium on larger homes tends to grow non-linearly. Going from 4 to 5 rooms adds less value than going from 7 to 8 rooms.

  • RM × LSTAT (rooms × % lower-status population) captures an interaction a linear model misses — a large home in a low-status neighborhood may not command the same price as the same home in a high-status one.

  • On the other hand, expanding a binary feature like CHAS (whether the tract borders the Charles River, 0 or 1) makes no sense — squaring a 0/1 column simply reproduces the same column.

In this tutorial we apply PolynomialFeatures to all features for simplicity, and let Lasso handle the cleanup — since it will automatically zero out interaction terms that don't contribute meaningfully to predicting MEDV. This is a common and practical strategy: expand aggressively, then regularize to prune.

That said, this approach doesn't eliminate the need for business logic entirely. There are still a few areas where domain knowledge matters:

  • Feature selection upfront — including completely irrelevant features (e.g. a random ID column) just adds noise that regularization has to clean up. It's worth filtering those out before expansion.

  • Skewed features — heavily skewed features like CRIM (crime rate) may need log transformation before expansion, otherwise squaring them produces extreme values that can destabilize the model.

  • Meaningless polynomial terms — squaring a binary feature like CHAS (0 or 1) simply reproduces the same column. Business logic helps you avoid generating nonsensical terms.

  • Interpretability — if stakeholders need to understand why the model makes a prediction, a 104-feature expanded model is harder to explain than one built around meaningful, hand-crafted features.

So polynomial expansion + regularization reduces how much you need to think about feature engineering, but it doesn't eliminate the need entirely.

The new features generated by PolynomialFeatures follow a simple naming convention:

  • Names without any operator (e.g. DIS, RAD) are the original features retained at degree 1

  • Names with ^2 (e.g. RM^2, LSTAT^2) are a feature multiplied by itself

  • Names with a space between two feature names (e.g. RM TAX, CHAS RM) are interaction terms — the product of two different original features

With 104 features, overfitting becomes a real risk, and regularization becomes genuinely important. The correct pipeline order is: polynomial expansion first, then scaling. If you scale first, features are centered around zero, which changes the meaning of their squared and interaction terms. Scaling after expansion ensures all 104 features end up on the same scale for the regularization penalty to be applied fairly.

scores = []

# PolynomialFeatures with degree=2 generates squared and interaction terms
pf = PolynomialFeatures(degree=2, include_bias=False)

for alpha in alphas:
    # max_iter increased to ensure Lasso converges with more features
    lasso = Lasso(alpha=alpha, max_iter=10000)

    # Pipeline order: expand features → scale → fit Lasso
    estimator = Pipeline([
        ("polynomial_features", pf),
        ("scaler", StandardScaler()),
        ("lasso_regression", lasso)
    ])

    predictions = cross_val_predict(estimator, X, y, cv=kf)
    score = r2_score(y, predictions)
    scores.append(score)

# Plot alpha values vs. cross-validated R² scores
plt.plot(alphas, scores)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('R² Score')
plt.title('Lasso + Polynomial Features: Alpha vs. Cross-Validated R²')
plt.show()

Output:

Note the max_iter=10000 argument — Lasso uses an iterative optimization algorithm internally, and with 104 features the default number of iterations may not be enough to reach convergence. Increasing it ensures the model finds the true optimal coefficients.

Looking at the plot, the curve now has a clear peak, unlike the flat curve we saw without polynomial features. The R² score rises as alpha increases from very low values, reaches a maximum around alpha=0.01, then drops off as the model becomes too heavily penalized. This is the bias-variance trade-off in action: too little regularization leads to overfitting on the expanded feature space, while too much strips out too many useful features.

The best alpha from this search is 0.01, which we'll use to train the final Lasso model in the next section.

8. Ridge Regression

Ridge regression is the other major form of linear regularization. Like Lasso, it adds a penalty controlled by alpha to prevent overfitting. The key difference lies in how the penalty works:

  • Lasso penalizes the sum of the absolute values of the coefficients, which drives some of them to exactly zero — performing feature selection

  • Ridge penalizes the sum of the squared values of the coefficients, which shrinks all of them toward zero but rarely eliminates any completely

In practice, Ridge tends to work better when most features are genuinely useful and you want to retain all of them with reduced magnitude. Lasso is preferable when you suspect many features are irrelevant and want automatic selection.

We run the same pipeline search as before, just swapping Lasso for Ridge:

scores = []

pf = PolynomialFeatures(degree=2, include_bias=False)

for alpha in alphas:
    # Ridge uses the same alpha parameter as Lasso
    ridge = Ridge(alpha=alpha)

    estimator = Pipeline([
        ("polynomial_features", pf),
        ("scaler", StandardScaler()),
        ("ridge_regression", ridge)
    ])

    predictions = cross_val_predict(estimator, X, y, cv=kf)
    score = r2_score(y, predictions)
    scores.append(score)

# Plot alpha values vs. cross-validated R² scores
plt.plot(alphas, scores)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('R² Score')
plt.title('Ridge + Polynomial Features: Alpha vs. Cross-Validated R²')
plt.show()

Output:

Unlike Lasso, Ridge doesn't require a high max_iter — its optimization has a closed-form solution and converges reliably regardless of the number of features.

The Ridge curve behaves quite differently from the Lasso one. Rather than peaking and dropping back down, R² rises monotonically across the entire alpha range — still climbing at alpha=1, the upper bound of our search. This tells us two things: first, Ridge clearly benefits from regularization on this expanded feature space; second, the optimal alpha likely sits even higher than 1, beyond what our current search covers. This is exactly the kind of insight that motivates using GridSearchCV in Section 10 — it can search a much wider and finer grid of hyperparameters simultaneously, rather than relying on a manually defined range.

9. Feature Importance & Interpretability

Beyond prediction, we often want to understand which features are driving the model's output. For linear models, the coefficients serve as a natural measure of feature importance — but only if all features are on the same scale. A large coefficient on a feature measured in thousands means something very different from a large coefficient on a feature measured between 0 and 1.

Since we've already scaled our features via StandardScaler, each coefficient now represents: for one standard deviation change in this feature, how much does the predicted home value change? This puts all features on an equal footing for comparison.

We fit the best Lasso model (from Section 7, alpha=0.01 with polynomial features) on the entire dataset to get the most stable coefficient estimates:

s = StandardScaler()

# Build the best pipeline using the optimal alpha found in Section 7
best_estimator = Pipeline([
    ("polynomial_features", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", s),
    ("lasso_regression", Lasso(alpha=0.01, max_iter=10000))
])

# Fit on the entire dataset to get coefficients across all data
best_estimator.fit(X, y)

To extract feature names and their corresponding coefficients, we access each step of the pipeline by name using .named_steps:

# Get the feature names generated by PolynomialFeatures
feature_names = best_estimator.named_steps["polynomial_features"].get_feature_names_out()

# Get the coefficients from the fitted Lasso model
coefficients = best_estimator.named_steps["lasso_regression"].coef_

# Combine into a DataFrame for easy sorting and inspection
df_importances = pd.DataFrame(zip(feature_names, coefficients), columns=["feature", "coefficient"])

# Sort by coefficient magnitude to see the most influential features
df_importances.sort_values(by="coefficient")

Output:

feature  coefficient
7            DIS    -7.228785
72        RM TAX    -6.677045
51       CHAS RM    -4.684862
93     RAD LSTAT    -4.545681
71        RM RAD    -4.430641
..           ...          ...
3           CHAS     4.611276
95   TAX PTRATIO     5.129595
103      LSTAT^2     5.635229
68          RM^2     8.594069
8            RAD     9.749688

[104 rows x 2 columns]

When you use PolynomialFeatures(degree=2, include_bias=False), it generates all of the following at once and passes them as separate columns into the model:

  • All 13 original features (degree 1): CRIM, ZN, DIS, RAD, etc.

  • All 13 squared features (degree 2): CRIM^2, ZN^2, DIS^2, etc.

  • All pairwise interactions (degree 2): CRIM ZN, CRIM INDUS, RM TAX, etc.

That's how we get from 13 to 104 features. The model receives all 104 simultaneously and learns a coefficient for each one. Lasso then shrinks many of those coefficients to zero, keeping only the most useful ones — which is why the final table shows a mix of degree 1 terms like DIS and degree 2 terms like RM^2 and RM TAX side by side. They all started in the model together; Lasso just decided which ones to keep.

The feature names follow PolynomialFeatures' naming convention. All 104 features are generated from the original 13, but they appear in three forms:

  • Names without any operator (e.g. DIS, RAD) are the original features at degree 1 — meaning Lasso kept the linear term but zeroed out their squared and interaction versions

  • Names with ^2 (e.g. RM^2, LSTAT^2) are a feature multiplied by itself

  • Names with a space between two feature names (e.g. RM TAX, CHAS RM) are interaction terms — the product of two different original features

The features with the largest positive coefficients push predicted home values up, while those with the largest negative coefficients pull them down. Because Lasso zeroed out many coefficients entirely, only a subset of the 104 features will appear as genuinely influential.

From the output, a few patterns stand out. RM² (rooms squared) consistently carries a large positive coefficient — confirming the non-linear premium on larger homes we anticipated earlier. On the negative side, interaction terms involving LSTAT (lower-status population percentage) tend to have strong negative coefficients, reflecting how neighborhood socioeconomic status compounds the effect of other features on home prices. Features that Lasso zeroed out can be interpreted as contributing little incremental information once the others are accounted for.

10. Grid Search with GridSearchCV

In Sections 6–8, we searched for the best alpha by manually looping over candidate values and evaluating each one with cross_val_predict. This works, but it has two limitations: it only searches one hyperparameter at a time, and it leaves the estimator unfitted at the end. GridSearchCV solves both problems.

It takes a pipeline and a dictionary of hyperparameter grids, then exhaustively evaluates every combination using cross-validation. Once the best combination is found, it automatically retrains on the entire dataset using those hyperparameters — giving you a production-ready model in one step.

We search over both degree (for PolynomialFeatures) and alpha (for Ridge) simultaneously:

from sklearn.model_selection import GridSearchCV

# Same pipeline structure as before
estimator = Pipeline([
    ("polynomial_features", PolynomialFeatures()),
    ("scaler", StandardScaler()),
    ("ridge_regression", Ridge())
])

# Define the hyperparameter grid to search over
# Use double underscore to reference hyperparameters inside a named step
params = {
    'polynomial_features__degree': [1, 2, 3],
    'ridge_regression__alpha': np.geomspace(4, 20, 20)
}

# Pass in the KFold object to control how splits are made
grid = GridSearchCV(estimator, params, cv=kf)

# Fit runs cross-validation across all 60 combinations (3 degrees × 20 alphas)
# then retrains on the full dataset using the best hyperparameters
grid.fit(X, y)

Output:

GridSearchCV(cv=KFold(n_splits=3, random_state=72018, shuffle=True),
             estimator=Pipeline(steps=[('polynomial_features',
                                        PolynomialFeatures()),
                                       ('scaler', StandardScaler()),
                                       ('ridge_regression', Ridge())]),
             param_grid={'polynomial_features__degree': [1, 2, 3],
                         'ridge_regression__alpha': array([ 4.        ,  4.3535936 ,  4.73844431,  5.15731521,  5.61321363,
        6.10941274,  6.64947505,  7.23727802,  7.87704182,  8.57335972,
        9.331231  , 10.1560969 , 11.05387963, 12.03102491, 13.09454827,
       14.25208539, 15.51194695, 16.88317825, 18.37562421, 20.        ])})

The double-underscore syntax (polynomial_features__degree) is how you reference a hyperparameter inside a named pipeline step — the step name on the left, the parameter name on the right.

After fitting, we can inspect the best result:

# Best cross-validated score and the hyperparameters that produced it
print(grid.best_score_, grid.best_params_)

Output:

0.8546333782070409 {'polynomial_features__degree': 2, 'ridge_regression__alpha': 4.0}

The best cross-validated R² score is 0.855, achieved with degree=2 and alpha=4.0. This is a meaningful improvement over the plain linear regression score of 0.706 from Section 5, demonstrating that the combination of polynomial features and Ridge regularization — tuned via cross-validation — generalizes substantially better to unseen data.

It's also worth noting that the optimal alpha of 4.0 sits at the lower end of our search range (4 to 20), which contrasts with the Ridge curve in Section 8 that was still rising at alpha=1. This is because GridSearchCV searched a different and wider alpha range here, and also jointly optimized degree at the same time — finding that degree=2 with moderate regularization outperforms the combinations we tested manually.

We can then use grid.predict() directly on new data, since GridSearchCV has already refitted the best model on the full dataset:

# Predict on the full dataset (in-sample, for demonstration)
y_predict = grid.predict(X)
r2_score(y, y_predict)

Output:

0.8964707714558118

The in-sample R² of 0.896 is higher than the cross-validated score of 0.855 — and that's expected. When predicting on the same data the model was trained on, it will always score higher than on truly unseen data. The cross-validated score of 0.855 is the more honest estimate of how the model will perform in production. The gap between the two (0.896 vs 0.855) is relatively small, which is a good sign that the model is not severely overfitting.

For a clearer view of how every hyperparameter combination performed, we can plot the mean cross-validated R² score for each degree across all alpha values:

cv_results = pd.DataFrame(grid.cv_results_)

# Plot one line per degree
for degree in [1, 2, 3]:
    mask = cv_results['param_polynomial_features__degree'] == degree
    plt.plot(
        cv_results[mask]['param_ridge_regression__alpha'],
        cv_results[mask]['mean_test_score'],
        label=f'degree={degree}'
    )

plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Mean CV R² Score')
plt.title('GridSearchCV: Degree vs. Alpha vs. R²')
plt.legend()
plt.show()

Output:

The plot makes the comparison across all three degrees immediately clear. degree=1 (blue) is flat and far below the others throughout — polynomial features are clearly essential for this dataset. degree=2 (orange) starts highest at low alpha (~0.855) but drops steadily as alpha increases, meaning too much regularization strips out the useful polynomial terms. degree=3 (green) starts slightly lower but remains more stable across the alpha range, eventually overtaking degree=2 at higher alpha values. The winner is degree=2 at the lowest alpha in the search range (4.0), where it achieves the best balance between complexity and generalization.

Conclusion

In this tutorial we built up the full cross-validation workflow from scratch. We started with a manual KFold loop to understand the mechanics, then added preprocessing with StandardScaler and saw how Pipeline eliminates error-prone boilerplate by chaining steps safely. We used cross_val_predict to get reliable out-of-sample scores in a single call, then introduced Lasso and Ridge regression to show how regularization controlled by alpha can meaningfully improve generalization — especially once the feature space was expanded with PolynomialFeatures. Finally, GridSearchCV brought everything together, searching over multiple hyperparameters simultaneously and delivering a fitted, production-ready model.

The progression reflects how these tools are used in practice: cross-validation as an honest evaluation mechanism, pipelines as a safeguard against data leakage, and grid search as the engine that turns hyperparameter selection from guesswork into a principled, reproducible process.