# Subset Selection Methods

In this tutorial, we will write several functions to select and evaluate subsets of features of baseball players used to predict salary.

Much of this code was provided by Professor Kucheryavyy; I have broken the code down into a few smaller pieces and added some comments and explanations that should help your understanding. You can view the code for this tutorial here.

## Importing Libraries

We will be using several libaries to aid our modeling and graphing:

import itertools
import pandas as pd
import numpy as np
import copy

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display

# Reset all styles to the default:
plt.rcParams.update(plt.rcParamsDefault)
# Then make graphs inline:
%matplotlib inline

# Set custom style settings:
# NB: We need to separate "matplotlib inline" call and these settings into different
# cells, otherwise the parameters are not set. This is a bug somewhere in Jupyter
plt.rcParams['figure.figsize'] = (7, 6)
plt.rcParams['font.size'] = 24
plt.rcParams['legend.fontsize'] = 'large'
plt.rcParams['figure.titlesize'] = 'large'
plt.rcParams['lines.markersize'] = 10


We won’t necessarily use all of these in this specific tutorial, but please import the above so that you have a wide array of tools at your disposal. You can feel free to use this libraries in your homework as well!

## Our Dataset

In this tutorial, we will be taking a look at Hitters.csv, a dataset that describes several attributes of major league players, such as salary, number of homeruns, etc.

First, let’s load our data with the following:

hittersDF = pd.read_csv('Hitters.csv', na_values=[""])


If you look at the first few rows of the data (hittersDF.head()), you may notice that the first column has no name. We’ll give it a name using the following:

hittersDF.rename(columns={hittersDF.columns : "Name"}, inplace=True, copy=False)
hittersDF.set_index('Name', inplace=True)


We will also remove any missing values with the following:

hittersDF.dropna(inplace=True)


Note: inplace=True means that we are modifying the original dataframe itself. We are not creating a copy with our new changes.

We want to convert categorial variables into dummy variables. Conversion into dummy variables allows us to perform regression for quanlitative variables. Let’s run the following:

dummies = pd.get_dummies(hittersDF[['League', 'Division', 'NewLeague']])
X = hittersDF.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1)
cat_features = ['League_N', 'Division_W', 'NewLeague_N']
num_features = list(X.columns)


Now, we can better define our X and y variables like so:

X = pd.concat([X, dummies[cat_features]], axis=1).astype('float64')
y = hittersDF.Salary


Let’s check out the first ten rows for our features:

display(X[0:10])

NameAtBatHitsHmRunRunsRBIWalksYearsCAtBatCHitsCHmRunCRunsCRBICWalksPutOutsAssistsErrorsLeague_NDivision_WNewLeague_N
-Alan Ashby315.081.07.024.038.039.014.03449.0835.069.0321.0414.0375.0632.043.010.01.01.01.0
-Alvin Davis479.0130.018.066.072.076.03.01624.0457.063.0224.0266.0263.0880.082.014.00.01.00.0
-Andre Dawson496.0141.020.065.078.037.011.05628.01575.0225.0828.0838.0354.0200.011.03.01.00.01.0
-Andres Galarraga321.087.010.039.042.030.02.0396.0101.012.048.046.033.0805.040.04.01.00.01.0
-Alfredo Griffin594.0169.04.074.051.035.011.04408.01133.019.0501.0336.0194.0282.0421.025.00.01.00.0
-Al Newman185.037.01.023.08.021.02.0214.042.01.030.09.024.076.0127.07.01.00.00.0
-Argenis Salazar298.073.00.024.024.07.03.0509.0108.00.041.037.012.0121.0283.09.00.01.00.0
-Andres Thomas323.081.06.026.032.08.02.0341.086.06.032.034.08.0143.0290.019.01.01.01.0
-Andre Thornton401.092.017.049.066.065.013.05206.01332.0253.0784.0890.0866.00.00.00.00.00.00.0
-Alan Trammell574.0159.021.0107.075.059.010.04631.01300.090.0702.0504.0488.0238.0445.022.00.00.00.0

## Best Subset Selection

### Best Subset Selection Functions

Now, let’s write a couple functions that will help us select the best subset of our features.

We will first write a function that allows us to input our set of features (X), the variable we want to predict (Y, or salary in this case), and define a specific size for our subset (subset_size). To do this, we will loop through all possible combinations of our features from X that are of size subset_size. In each iteration of our loop, we fit a linear regression model and calculate a mean squared error. We keep track of the best mean squared error through compariosn, which allows us to select the best subset in the end:

def findBestSubsetFixedSize(X, y, subset_size):
features_nmb = X.shape
best_subset = []
best_mse = -1
for idx_set in itertools.combinations(range(features_nmb), subset_size):
X_subset = X.iloc[:, list(idx_set)]
lin_reg = LinearRegression(fit_intercept=True, normalize=False)
lin_reg.fit(X_subset, y)
yhat = lin_reg.predict(X_subset)
mse_resid = mean_squared_error(y, yhat)
if best_mse < 0 or mse_resid < best_mse:
best_subset = list(idx_set)
best_mse = mse_resid
return([best_subset, best_mse])


We also want to write a function that simply allows us to set an upper limit on the size of the subset. To do this, we will write a loop that tests all possible values for our subset size and inputs them into the function we just wrote, findBestSubsetFixedSize():

def findBestSubset(X, y, max_subset_size):
best_subsets = [None] * max_subset_size
best_mses = [None] * max_subset_size
for subset_size in range(1, max_subset_size + 1):
best_subsets[subset_size-1], best_mses[subset_size-1] =\
findBestSubsetFixedSize(X, y, subset_size)

return([best_subsets, best_mses])


### Executing Best Subset Selection

We’ll now use this function to perform our subset selection. At the very end, we will produce some metrics that will help us evaluate our selections:

• adjr2s = Adjusted R-Squared
• bics = Bayesian information Criteria
• aics = Akaike’s Information Criteria
# Since the exhaustive search takes a really long time to complete,
# I recorded the result of this procedure
run_exhaustive_search = False
if run_exhaustive_search:
# This procedure takes really long time to complete!
best_subsets, best_mses = findBestSubset(X, y, 3)
else:
best_subsets = [,
[1, 11],
[1, 11, 13],
[1, 11, 13, 17],
[0, 1, 11, 13, 17],
[0, 1, 5, 11, 13, 17],
[1, 5, 7, 8, 9, 13, 17],
[0, 1, 5, 9, 10, 12, 13, 17],
[0, 1, 5, 7, 10, 11, 12, 13, 17],
[0, 1, 5, 7, 10, 11, 12, 13, 14, 17],
[0, 1, 5, 7, 10, 11, 12, 13, 14, 16, 17],
[0, 1, 3, 5, 7, 10, 11, 12, 13, 14, 16, 17],
[0, 1, 3, 5, 7, 10, 11, 12, 13, 14, 15, 16, 17],
[0, 1, 2, 3, 5, 7, 10, 11, 12, 13, 14, 15, 16, 17],
[0, 1, 2, 3, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17],
[0, 1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17],
[0, 1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]]
best_mses = [137565.32036137575, 116526.84368963055, 111214.05648618752, 106353.04872933947,
103231.5567757093,   99600.39516195898,  98503.98289210549,  95577.68037627422,
94350.00527219362,  93157.42029558783,  92727.54772410596,  92521.79611890586,
92354.1742898915,   92200.22963038784,  92148.96332783563,  92088.88772977113,
92051.12835223945,  92022.19527998423,  92017.86901772919]

bics = [None] * len(best_subsets)
aics = [None] * len(best_subsets)
for idx_set in range(len(best_subsets)):
X_subset = X.iloc[:, best_subsets[idx_set]].values
result = sm.OLS(y, X_subset).fit()
bics[idx_set] = result.bic
aics[idx_set] = result.aic


### Plotting

As always, we want to create some plots so that we can better visaulize and understand what’s going on. To do this, we’ll first write a function that takes in our metrics:

def makePlotsForBestSubsets(adjr2s, bics, aics, mses):
markerSize = 8
titlesFontSize = 18
axisLabelFontSize = 16

xvals = range(1, subsetsNmb + 1)

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10, 8))

ax2.plot(xvals, bics, '-o', markersize=markerSize)
ax2.set_ylabel('BIC', fontsize=titlesFontSize)

ax3.plot(xvals, aics, '-o', markersize=markerSize)
ax3.set_ylabel('AIC', fontsize=titlesFontSize)

ax4.plot(xvals, mses, '-o', markersize=markerSize)
ax4.set_ylabel('MSE', fontsize=titlesFontSize)

for ax in fig.axes:
ax.set_xlabel('Number of variables', fontsize=titlesFontSize)
ax.set_xlim(0.5, subsetsNmb + 1)
ax.set_xticks(range(2, subsetsNmb + 1, 2));
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(axisLabelFontSize)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(axisLabelFontSize) ## Forward and Backward Stepwise Selection

Here, we’ll write functions for forward and backward stepwise selection. Forward and backward stepwise selection follows a very similar logic to our best subsect selection, but looks at a more restrictive set of models.

### Backward Selection

def doBwdSelectStep(X, y, cur_subset):
best_subset = []
best_mse = -1
for feature_idx in cur_subset:
reduced_subset = list(set(cur_subset) - {feature_idx})
X_subset = X.iloc[:, reduced_subset]
lin_reg = LinearRegression(fit_intercept=True, normalize=False)
lin_reg.fit(X_subset, y)
yhat = lin_reg.predict(X_subset)
mse_resid = mean_squared_error(y, yhat)
if best_mse < 0 or mse_resid < best_mse:
best_subset = reduced_subset
best_mse = mse_resid
return([best_subset, best_mse])

def doBwdStepwiseSelect(X, y, starting_set):
steps_nmb = len(starting_set)
best_subsets = [None] * steps_nmb
best_mses = [None] * steps_nmb

X_subset = X.iloc[:, starting_set]
lin_reg = LinearRegression(fit_intercept=True, normalize=False)
lin_reg.fit(X_subset, y)
yhat = lin_reg.predict(X_subset)
mse_resid = mean_squared_error(y, yhat)

best_subsets = starting_set
best_mses = mse_resid

for step in range(steps_nmb - 1):
best_subsets[step+1], best_mses[step+1] = doBwdSelectStep(X, y, best_subsets[step])
return([best_subsets, best_mses])


### Forward Selection

def doFwdSelectStep(X, y, cur_subset):
features_nmb = X.shape
new_features = set(range(features_nmb)) - set(cur_subset)

best_subset = []
best_mse = -1
for feature_idx in new_features:
increased_subset = cur_subset + [feature_idx]
X_subset = X.iloc[:, increased_subset]
lin_reg = LinearRegression(fit_intercept=True, normalize=False)
lin_reg.fit(X_subset, y)
yhat = lin_reg.predict(X_subset)
mse_resid = mean_squared_error(y, yhat)
if best_mse < 0 or mse_resid < best_mse:
best_subset = increased_subset
best_mse = mse_resid
return([best_subset, best_mse])

def doFwdStepwiseSelect(X, y, starting_set):
features_nmb = X.shape
steps_nmb = features_nmb - len(starting_set)
best_subsets = [None] * steps_nmb
best_mses = [None] * steps_nmb
prev_subset = starting_set
for step in range(steps_nmb):
best_subsets[step], best_mses[step] = doFwdSelectStep(X, y, prev_subset)
prev_subset = best_subsets[step]
return([best_subsets, best_mses])


### Executing Forward and Backward Stepwise Selection

best_bwd_subsets, best_bwd_mses = doBwdStepwiseSelect(X, y, list(range(19)))
# Reverse the lists:
best_bwd_subsets = best_bwd_subsets[::-1]
best_bwd_mses = best_bwd_mses[::-1]
bwd_sets_nmb = len(best_bwd_subsets)
bwd_bics = [None] * bwd_sets_nmb
bwd_aics = [None] * bwd_sets_nmb
for idx_set in range(bwd_sets_nmb):
X_subset = X.iloc[:, best_bwd_subsets[idx_set]].values
result = sm.OLS(y, X_subset).fit()
bwd_bics[idx_set] = result.bic
bwd_aics[idx_set] = result.aic

best_fwd_subsets, best_fwd_mses = doFwdStepwiseSelect(X, y, [])
fwd_sets_nmb = len(best_fwd_subsets)
fwd_bics = [None] * fwd_sets_nmb
fwd_aics = [None] * fwd_sets_nmb
for idx_set in range(fwd_sets_nmb):
X_subset = X.iloc[:, best_fwd_subsets[idx_set]].values
result = sm.OLS(y, X_subset).fit()
fwd_bics[idx_set] = result.bic
fwd_aics[idx_set] = result.aic


### Plotting

Again, let’s make plots for our subsets produced here, using the function we wrote earlier.

#### Backward Subsets

makePlotsForBestSubsets(bwd_adjr2s, bwd_bics, bwd_aics, best_bwd_mses) #### Forward Subsets

makePlotsForBestSubsets(fwd_adjr2s, fwd_bics, fwd_aics, best_fwd_mses) 