Ridge, Lasso, and PCR
Table of contents
- Getting Started
- Creating Data Pipelines
- Print Function
- Ridge Regression
- Lasso Regression
- Principal Component Regression
In this tutorial, we will be exploring two linear regression models (ridge regression and lasso regression) and a regression analysis technique known as principal component regression (PCR).
Much of this code was provided by Professor Kucheryavyy; I have broken the code down into a few smaller pieces and added some comments and explanations that should help your understanding. You can view the code for this tutorial here.
We’ll be using the same dataset as before, so please refer the previous tutorial if you’d like to read about the dataset once more.
Getting Started
Importing Libraries
import itertools
import pandas as pd
import numpy as np
import copy
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display
Plot and Output Settings
We’ll also introduce a few extra settings just to make the output of each of our cells a bit nicer:
# Reset all styles to the default:
plt.rcParams.update(plt.rcParamsDefault)
# Then make graphs inline:
%matplotlib inline
# Useful function for Jupyter to display text in bold:
def displaybd(text):
display(Markdown("**" + text + "**"))
If you would like your plots to be a bit larger, please use the following code:
plt.rcParams['figure.figsize'] = (7, 6)
plt.rcParams['font.size'] = 24
plt.rcParams['legend.fontsize'] = 'large'
plt.rcParams['figure.titlesize'] = 'large'
plt.rcParams['lines.markersize'] = 10
Loading Data
As a reminder, we can load our data and set our variables like so:
# LOAD DATA:
hittersDF = pd.read_csv('Hitters.csv', na_values=[""])
# The first column has no name in the csv file:
hittersDF.rename(columns={hittersDF.columns[0] : "Name"}, inplace=True, copy=False)
hittersDF.set_index('Name', inplace=True)
hittersDF.dropna(inplace=True)
# CREATE X and y:
# Convert categorical variables into dummies:
dummies = pd.get_dummies(hittersDF[['League', 'Division', 'NewLeague']])
X = hittersDF.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1)
cat_features = ['League_N', 'Division_W', 'NewLeague_N']
num_features = list(X.columns)
X = pd.concat([X, dummies[cat_features]], axis=1).astype('float64')
y = hittersDF.Salary
Creating Data Pipelines
Using scikit-learn
, we can create data pipelines to clean, scale, and streamline our data:
# From here: https://kiwidamien.github.io/introducing-the-column-transformer.html
# and from "Hands-On Machine Learning with Scikit-Learn & TensorFlow" by Aurélien Géron.
# Feature scaling is super important for Ridge and Lasso.
# Pipeline for numerical columns:
num_pipeline = Pipeline([
# Replace missing numerical values with their median values.
# Actually, our dataset does not have missing values (we dropped them all),
# but in case we decide not to drop the missing values, we can do the imputation.
('impute', SimpleImputer(strategy='median')),
('std_scaler', StandardScaler()),
])
# Pipeline for categorical columns:
cat_pipeline = Pipeline([
# Do not scale data. Replace missing values with the most frequent values:
('impute', SimpleImputer(strategy='most_frequent'))
])
full_preprocess_pipeline = ColumnTransformer([
('numerical', num_pipeline, num_features),
('categorical', cat_pipeline, cat_features)
])
train_preprocess_pipeline = copy.copy(full_preprocess_pipeline)
X_processed = full_preprocess_pipeline.fit_transform(X)
#-- SPLIT INTO TWO TRAIN AND TEST DATASETS:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
X_train_processed = train_preprocess_pipeline.fit_transform(X_train)
# Note that we scale test data using the train data's mean and variance. This is important.
X_test_processed = train_preprocess_pipeline.transform(X_test)
print(X_train_processed[:4])
Print Function
Now, let’s write a function printCoefs
to print the results of the ridge and lasso regressions that we will be doing. This function prints coefficients for both scaled and unscaled data:
def printCoefs(regr_res, coef_names, num_features_idx, mean, var):
scaled_coefs = np.hstack((regr_res.intercept_, regr_res.coef_))
# Only numerical features are scaled. So, we need to extract them, unscale
# the coefficients, then merge coefficients for numerical and categorical
# features.
cat_features_idx = [x for x in range(len(regr_res.coef_))\
if x not in num_features_idx]
unscaled_num_coefs = regr_res.coef_[num_features_idx] / np.sqrt(var)
unscaled_intercept = regr_res.intercept_ - np.sum(unscaled_num_coefs * mean)
unscaled_coefs = np.array([None] * (len(regr_res.coef_) + 1))
# Intercept goes first, then the rest of coefficients:
unscaled_coefs[0] = unscaled_intercept
unscaled_coefs[[i+1 for i in num_features_idx]] = unscaled_num_coefs
unscaled_coefs[[i+1 for i in cat_features_idx]] = ridge.coef_[cat_features_idx]
results = pd.DataFrame(data={'Scaled' : scaled_coefs,
'Unscaled' : unscaled_coefs},
index=['Intercept']+coef_names)
print(results)
Ridge Regression
Ridge regression minimizes the following objective function: \[||y - Xw||^2_2 + \alpha||w||^2_2 \]
Modeling
Conveniently, scikit-learn
provides us with a model for ridge regression already. Using that, let’s fit the ridge regression model, store its coefficients, and then create a plot of alpha values versus weights.
# NB: glmnet uses (1/N)*||y - Xw||^2_2 + lambda/2 * ||w||^2_2, while
# Ridge uses ||y - Xw||^2_2 + alpha * ||w||^2_2. So, there is a difference.
alphas = 10**np.linspace(5, -2, 300)
# NB: Do not use the normalize option of Ridge, use scale or StandardScaler from sklearn.preprocessing
# See here: https://stackoverflow.com/questions/24098212/what-does-the-option-normalize-true-in-lasso-sklearn-do
ridge = Ridge(fit_intercept=True, normalize=False, tol=1e-12)
coefs = []
for a in alphas:
ridge.set_params(alpha=a)
ridge.fit(X_processed, y)
coefs.append(ridge.coef_)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization');
We will now also fit a ridge regression model for \(\alpha = 4\) and show prediction error on test data:
ridge2 = Ridge(alpha=4, normalize=False, tol=1e-12)
ridge2.fit(X_train_processed, y_train)
yhat = ridge2.predict(X_test_processed)
print("Ridge regression with alpha =", 4)
print("MSE:", mean_squared_error(y_test, yhat))
Ridge regression with alpha = 4
MSE: 102084.02878693413
Choosing an Optimal \(\alpha\)
Now, we will choose the optimal value for \(\alpha\) using cross-validation. We first create a pipline and then use GridSearchCV to get the optimal value:
# NB: Don't use 'RidgeCV'!
# See here: https://stackoverflow.com/questions/39466671/use-of-scaler-with-lassocv-ridgecv
ridge_pipeline = Pipeline([
('scaler', train_preprocess_pipeline),
('regressor', Ridge(fit_intercept=True, normalize=False, tol=1e-12))
])
# We need to supply parameter alpha to Ridge. Note the convention adopted to
# name the parameters: name of the pipeline step, followed by a double underscore (__),
# then finally the name of the parameter within the step
pipeline_params = {'regressor__alpha': alphas}
# Using our pipline with GridSearchCV allows us to scale the train part of X_train in each fold
# and then apply that scaling to the test part of X_train in each fold. This is the correct way to
# K-fold cross-validation. This way there is no information leakage from the train part of a fold to
# the test part of a fold.
kf = KFold(n_splits=10, shuffle=True, random_state=1)
gridsearch = GridSearchCV(ridge_pipeline, param_grid=pipeline_params, cv=kf,
scoring='neg_mean_squared_error',
return_train_score=False, # We don't need them
n_jobs=6, # Use 6 CPU cores
verbose=1, iid=False).fit(X_train, y_train)
best_ridge_alpha = gridsearch.best_params_['regressor__alpha']
print("Best alpha:", best_ridge_alpha)
print('MSE on the test dataset is: ', -gridsearch.score(X_test, y_test))
Best alpha: 69.09790545715646
MSE on the test dataset is: 101735.58726985169
Printing Results
Let’s now use our printCoefs
function to execute the following:
ridge.set_params(alpha=best_ridge_alpha)
fitted_scaler = full_preprocess_pipeline.transformers_[0][1].steps[1][1]
displaybd("Regression coefficients of the optimal ridge regression on the full dataset:")
printCoefs(ridge, num_features + cat_features,
list(range(len(num_features))),
fitted_scaler.mean_, fitted_scaler.var_)
Regression coefficients of the optimal ridge regression on the full dataset:
Scaled Unscaled
Intercept 577.485786 163.13
AtBat -290.926712 -1.97873
Hits 337.235023 7.48755
HmRun 37.494045 4.28972
Runs -60.017331 -2.35443
RBI -26.662596 -1.0321
Walks 134.927434 6.22453
Years -17.051308 -3.56387
CAtBat -388.771534 -0.170347
CHits 88.555476 0.136878
CHmRun -12.903281 -0.157278
CRuns 477.616374 1.44483
CRBI 258.394540 0.800597
CWalks -213.378882 -0.809623
PutOuts 78.761990 0.281895
Assists 53.657052 0.370548
Errors -22.191244 -3.36537
League_N 62.571806 62.5718
Division_W -116.894468 -116.894
NewLeague_N -24.797541 -24.7975
Lasso Regression
Lasso (least absolute shrinkage and selection operator) regression is another technique we have for variable selection and regularization. Lasso regression minimizes the following: \[\frac{1}{2 * n_{samples}} * ||y - Xw||^2_2 + \alpha||w||_1\]
Modeling
Scikit-learn
also provides us with a lasso regression model. Like we did in the previous section, let’s use this to fit a model, record our coefficients, and plot a graph of alpha values versus weights:
lasso = Lasso(fit_intercept=True, normalize=False, tol=1e-12, max_iter=100000)
coefs = []
for a in alphas:
lasso.set_params(alpha=a)
lasso.fit(X_processed, y)
coefs.append(lasso.coef_)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Lasso coefficients as a function of the regularization');
Choosing an Optimal \(\alpha\)
Again, we will create a pipeline, use this to perform cross-validation, and then choose an optimal \(\alpha\)
# NB: Again, don't use 'LassoCV'!
# See here: https://stackoverflow.com/questions/39466671/use-of-scaler-with-lassocv-ridgecv
lasso_pipeline = Pipeline([
('scaler', train_preprocess_pipeline),
('regressor', Lasso(fit_intercept=True, normalize=False, tol=1e-12, max_iter=10000))
])
# We need a much lower alpha for Lasso (it means a different thing in Lasso versus Ridge).
# I just divide it by 10:
pipeline_params = {'regressor__alpha': alphas / 10}
# Using our pipline with GridSearchCV allows us to scale the train part of X_train in each fold
# and then apply that scaling to the test part of X_train in each fold. This is the correct way to
# K-fold cross-validation. This way there is no information leakage from the train part of a fold to
# the test part of a fold.
kf = KFold(n_splits=10, shuffle=True, random_state=1)
gridsearch = GridSearchCV(lasso_pipeline, param_grid=pipeline_params, cv=kf,
scoring='neg_mean_squared_error',
return_train_score=False, # We don't need them
n_jobs=6, # Use 6 CPU cores
verbose=1, iid=False).fit(X_train, y_train)
best_lasso_alpha = gridsearch.best_params_['regressor__alpha']
print("Best alpha:", best_lasso_alpha)
print('MSE on the test dataset is: ', -gridsearch.score(X_test, y_test))
Best alpha: 0.7181039092537357
MSE on the test dataset is: 107596.2015393248
Printing Results
Let’s now use our printCoefs
function to execute the following:
lasso.set_params(alpha=best_lasso_alpha)
fitted_scaler = full_preprocess_pipeline.transformers_[0][1].steps[1][1]
displaybd("Regression coefficients of the optimal lasso regression on the full dataset:")
printCoefs(lasso, num_features + cat_features,
list(range(len(num_features))),
fitted_scaler.mean_, fitted_scaler.var_)
Regression coefficients of the optimal lasso regression on the full dataset:
Scaled Unscaled
Intercept 577.417128 162.994
AtBat -291.156068 -1.98029
Hits 337.534768 7.4942
HmRun 37.410636 4.28018
Runs -60.160402 -2.36004
RBI -26.527714 -1.02687
Walks 134.920241 6.2242
Years -16.900832 -3.53242
CAtBat -388.183560 -0.170089
CHits 86.005263 0.132936
CHmRun -13.326362 -0.162435
CRuns 479.406526 1.45025
CRBI 259.079389 0.802719
CWalks -213.675997 -0.81075
PutOuts 78.775224 0.281942
Assists 53.613395 0.370247
Errors -22.122089 -3.35488
League_N 62.292836 62.5718
Division_W -116.810487 -116.894
NewLeague_N -24.458230 -24.7975
Principal Component Regression
Principal component regression (PCR) is a technique based on principal component analysis (PCA) that allows us to estimate unknown regression coefficients..
Modeling
Here, we’ll use scikit-learn
’s PCA class to fit our processed X data.
pca_nonscaled = PCA()
pca_scaled = PCA()
# In ridge and lasso we were not scaling categorical variables.
# Similarly, we can do PCA without scaling categorical variables.
# Moreover, it is debatable, whether it is appropriate at all to
# include binary variables into PCA. Probably, it is OK.
# See here: https://stats.stackexchange.com/questions/16331/doing-principal-component-analysis-or-factor-analysis-on-binary-data
X_reduced_nonscaled = pca_nonscaled.fit_transform(X_processed)
scaler = StandardScaler()
X_reduced_scaled = pca_scaled.fit_transform(scaler.fit_transform(X))
#-- PLOT THE SHARE OF EXPLAINED VARIANCE:
ax = plt.gca()
nonscaled_cat_line, = \
ax.plot(range(1, X_reduced_nonscaled.shape[1] + 1),
np.cumsum(pca_nonscaled.explained_variance_ratio_),
'-o', markersize=4, color='blue')
scaled_cat_line, = \
ax.plot(range(1, X_reduced_scaled.shape[1] + 1),
np.cumsum(pca_scaled.explained_variance_ratio_),
'-o', markersize=4, color='red')
ax.set_xticks(range(1, X_reduced_scaled.shape[1] + 1, 2));
plt.xlabel('Number of components');
plt.ylabel('Share of explained variance');
plt.legend([nonscaled_cat_line, scaled_cat_line],
['Categorical scaled', 'Categorical unscaled'],
fontsize=18);
NOTE: In the above code, the legend is mistakenly flipped. This does affect our decision later on to use scaled variables, so please be careful of this. This page will be updated to fix this error soon.
Cross-Validated Test Score on the Full Dataset
This section is somewhat similar to how we chose optimal \(\alpha\) in the sections on ridge and lasso. We create pipelines for both unscaled categorical variables and scaled categorical variables and use GridSearchCV just like before:
n_features_to_test = np.arange(1, X.shape[1] + 1)
pipeline_params = {'reduce_dim__n_components': n_features_to_test}
kf = KFold(n_splits=10, shuffle=True, random_state=90)
# Pipeline for unscaled categorical variables:
pca_regr_pipe_unscaled = Pipeline([
('scaler', full_preprocess_pipeline),
('reduce_dim', PCA()),
('regressor', LinearRegression())
])
# Pipeline for scaled categorical variables:
pca_regr_pipe_scaled = Pipeline([
('scaler', StandardScaler()),
('reduce_dim', PCA()),
('regressor', LinearRegression())
])
gridsearch_unscaled = GridSearchCV(pca_regr_pipe_unscaled,
param_grid=pipeline_params, cv=kf,
scoring='neg_mean_squared_error',
return_train_score=False,
n_jobs=6, # Use 6 CPU cores
verbose=1, iid=False).fit(X, y)
gridsearch_scaled = GridSearchCV(pca_regr_pipe_scaled,
param_grid=pipeline_params, cv=kf,
scoring='neg_mean_squared_error',
return_train_score=False,
n_jobs=6, # Use 6 CPU cores
verbose=1, iid=False).fit(X, y)
Now, let’s make a plot of the cross-validated (root of) mean squared error):
rmse_unscaled = np.sqrt(-gridsearch_unscaled.cv_results_["mean_test_score"])
rmse_scaled = np.sqrt(-gridsearch_scaled.cv_results_["mean_test_score"])
ax = plt.gca()
nonscaled_cat_line, = \
ax.plot(range(1, len(rmse_unscaled) + 1), rmse_unscaled,
'-o', markersize=4, color='blue')
scaled_cat_line, = \
ax.plot(range(1, len(rmse_scaled) + 1), rmse_scaled,
'-o', markersize=4, color='red')
ax.set_xticks(range(1, len(rmse_unscaled) + 1, 2));
plt.xlabel('Number of components');
plt.ylabel('Root of CV MSE');
plt.legend([nonscaled_cat_line, scaled_cat_line],
['Categorical scaled', 'Categorical unscaled']);
NOTE: In the above code, the legend is mistakenly flipped. This does affect our decision later on to use scaled variables, so please be careful of this. This page will be updated to fix this error soon.
Cross-Validated Test Score on the Train Dataset
We’ll now consider the cross-validated test score for the train dataset as well. The results above indicate that scaling categorical variables gives better performance, so we’ll work work solely with scaled categorical variables now and create a plot again:
gridsearch_scaled = GridSearchCV(pca_regr_pipe_scaled,
param_grid=pipeline_params, cv=kf,
scoring='neg_mean_squared_error',
return_train_score=False,
n_jobs=6, # Use 6 CPU cores
verbose=1, iid=False).fit(X_train, y_train)
#-- PLOT CROSS-VALIDATED (ROOT OF) MEAN SQUARED ERROR:
rmse_scaled = np.sqrt(-gridsearch_scaled.cv_results_["mean_test_score"])
ax = plt.gca()
ax.plot(range(1, len(rmse_scaled) + 1), rmse_scaled,
'-o', markersize=4, color='red')
ax.set_xticks(range(1, len(rmse_scaled) + 1, 2));
plt.xlabel('Number of components');
plt.ylabel('Root of CV MSE');
plt.title('Scaled categorical variables')
Test MSE
Now, still using the pipline just for scaled categorical variables, we’ll calculate \(\hat{y}\) and use this to compute our mean squared error:
pca_regr_pipe_scaled.set_params(reduce_dim__n_components=7)
pca_regr_pipe_scaled.fit(X_train, y_train)
yhat = pca_regr_pipe_scaled.predict(X_test)
print("PCA regression with 7 components")
print("Test MSE:", mean_squared_error(y_test, yhat))
PCA regression with 7 components
Test MSE: 108266.95903472023