Link

Subset Selection Methods

Table of contents

  1. Importing Libraries
  2. Our Dataset
  3. Best Subset Selection
    1. Best Subset Selection Functions
    2. Executing Best Subset Selection
    3. Plotting
  4. Forward and Backward Stepwise Selection
    1. Backward Selection
    2. Forward Selection
    3. Executing Forward and Backward Stepwise Selection
    4. Plotting
      1. Backward Subsets
      2. Forward Subsets

In this tutorial, we will write several functions to select and evaluate subsets of features of baseball players used to predict salary.

Much of this code was provided by Professor Kucheryavyy; I have broken the code down into a few smaller pieces and added some comments and explanations that should help your understanding. You can view the code for this tutorial here.

Importing Libraries

We will be using several libaries to aid our modeling and graphing:

import itertools
import pandas as pd
import numpy as np
import copy

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display

# Reset all styles to the default:
plt.rcParams.update(plt.rcParamsDefault)
# Then make graphs inline:
%matplotlib inline

# Set custom style settings:
# NB: We need to separate "matplotlib inline" call and these settings into different
# cells, otherwise the parameters are not set. This is a bug somewhere in Jupyter
plt.rcParams['figure.figsize'] = (7, 6)
plt.rcParams['font.size'] = 24
plt.rcParams['legend.fontsize'] = 'large'
plt.rcParams['figure.titlesize'] = 'large'
plt.rcParams['lines.markersize'] = 10

We won’t necessarily use all of these in this specific tutorial, but please import the above so that you have a wide array of tools at your disposal. You can feel free to use this libraries in your homework as well!

Our Dataset

In this tutorial, we will be taking a look at Hitters.csv, a dataset that describes several attributes of major league players, such as salary, number of homeruns, etc.

First, let’s load our data with the following:

hittersDF = pd.read_csv('Hitters.csv', na_values=[""])

If you look at the first few rows of the data (hittersDF.head()), you may notice that the first column has no name. We’ll give it a name using the following:

hittersDF.rename(columns={hittersDF.columns[0] : "Name"}, inplace=True, copy=False)
hittersDF.set_index('Name', inplace=True)

We will also remove any missing values with the following:

hittersDF.dropna(inplace=True)

Note: inplace=True means that we are modifying the original dataframe itself. We are not creating a copy with our new changes.

We want to convert categorial variables into dummy variables. Conversion into dummy variables allows us to perform regression for quanlitative variables. Let’s run the following:

dummies = pd.get_dummies(hittersDF[['League', 'Division', 'NewLeague']])
X = hittersDF.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1)
cat_features = ['League_N', 'Division_W', 'NewLeague_N']
num_features = list(X.columns)

Now, we can better define our X and y variables like so:

X = pd.concat([X, dummies[cat_features]], axis=1).astype('float64')
y = hittersDF.Salary

Let’s check out the first ten rows for our features:

display(X[0:10])
NameAtBatHitsHmRunRunsRBIWalksYearsCAtBatCHitsCHmRunCRunsCRBICWalksPutOutsAssistsErrorsLeague_NDivision_WNewLeague_N
-Alan Ashby315.081.07.024.038.039.014.03449.0835.069.0321.0414.0375.0632.043.010.01.01.01.0
-Alvin Davis479.0130.018.066.072.076.03.01624.0457.063.0224.0266.0263.0880.082.014.00.01.00.0
-Andre Dawson496.0141.020.065.078.037.011.05628.01575.0225.0828.0838.0354.0200.011.03.01.00.01.0
-Andres Galarraga321.087.010.039.042.030.02.0396.0101.012.048.046.033.0805.040.04.01.00.01.0
-Alfredo Griffin594.0169.04.074.051.035.011.04408.01133.019.0501.0336.0194.0282.0421.025.00.01.00.0
-Al Newman185.037.01.023.08.021.02.0214.042.01.030.09.024.076.0127.07.01.00.00.0
-Argenis Salazar298.073.00.024.024.07.03.0509.0108.00.041.037.012.0121.0283.09.00.01.00.0
-Andres Thomas323.081.06.026.032.08.02.0341.086.06.032.034.08.0143.0290.019.01.01.01.0
-Andre Thornton401.092.017.049.066.065.013.05206.01332.0253.0784.0890.0866.00.00.00.00.00.00.0
-Alan Trammell574.0159.021.0107.075.059.010.04631.01300.090.0702.0504.0488.0238.0445.022.00.00.00.0

Best Subset Selection

Best Subset Selection Functions

Now, let’s write a couple functions that will help us select the best subset of our features.

We will first write a function that allows us to input our set of features (X), the variable we want to predict (Y, or salary in this case), and define a specific size for our subset (subset_size). To do this, we will loop through all possible combinations of our features from X that are of size subset_size. In each iteration of our loop, we fit a linear regression model and calculate a mean squared error. We keep track of the best mean squared error through compariosn, which allows us to select the best subset in the end:

def findBestSubsetFixedSize(X, y, subset_size):
    features_nmb = X.shape[1]
    best_subset = []
    best_mse = -1
    for idx_set in itertools.combinations(range(features_nmb), subset_size):
        X_subset = X.iloc[:, list(idx_set)]
        lin_reg = LinearRegression(fit_intercept=True, normalize=False)
        lin_reg.fit(X_subset, y)
        yhat = lin_reg.predict(X_subset)
        mse_resid = mean_squared_error(y, yhat)
        if best_mse < 0 or mse_resid < best_mse:
            best_subset = list(idx_set)
            best_mse = mse_resid
    return([best_subset, best_mse])

We also want to write a function that simply allows us to set an upper limit on the size of the subset. To do this, we will write a loop that tests all possible values for our subset size and inputs them into the function we just wrote, findBestSubsetFixedSize():

def findBestSubset(X, y, max_subset_size):
    best_subsets = [None] * max_subset_size
    best_mses = [None] * max_subset_size
    for subset_size in range(1, max_subset_size + 1):
        best_subsets[subset_size-1], best_mses[subset_size-1] =\
            findBestSubsetFixedSize(X, y, subset_size)

    return([best_subsets, best_mses])

Executing Best Subset Selection

We’ll now use this function to perform our subset selection. At the very end, we will produce some metrics that will help us evaluate our selections:

  • adjr2s = Adjusted R-Squared
  • bics = Bayesian information Criteria
  • aics = Akaike’s Information Criteria
# Since the exhaustive search takes a really long time to complete,
# I recorded the result of this procedure
run_exhaustive_search = False
if run_exhaustive_search:
    # This procedure takes really long time to complete!
    best_subsets, best_mses = findBestSubset(X, y, 3)
else:
    best_subsets = [[11],
                    [1, 11],
                    [1, 11, 13],
                    [1, 11, 13, 17],
                    [0, 1, 11, 13, 17],
                    [0, 1, 5, 11, 13, 17],
                    [1, 5, 7, 8, 9, 13, 17],
                    [0, 1, 5, 9, 10, 12, 13, 17],
                    [0, 1, 5, 7, 10, 11, 12, 13, 17],
                    [0, 1, 5, 7, 10, 11, 12, 13, 14, 17],
                    [0, 1, 5, 7, 10, 11, 12, 13, 14, 16, 17],
                    [0, 1, 3, 5, 7, 10, 11, 12, 13, 14, 16, 17],
                    [0, 1, 3, 5, 7, 10, 11, 12, 13, 14, 15, 16, 17],
                    [0, 1, 2, 3, 5, 7, 10, 11, 12, 13, 14, 15, 16, 17],
                    [0, 1, 2, 3, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17],
                    [0, 1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17],
                    [0, 1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18],
                    [0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18],
                    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]]
    best_mses = [137565.32036137575, 116526.84368963055, 111214.05648618752, 106353.04872933947,
                 103231.5567757093,   99600.39516195898,  98503.98289210549,  95577.68037627422,
                  94350.00527219362,  93157.42029558783,  92727.54772410596,  92521.79611890586,
                  92354.1742898915,   92200.22963038784,  92148.96332783563,  92088.88772977113,
                  92051.12835223945,  92022.19527998423,  92017.86901772919]
    
adjr2s = [None] * len(best_subsets)
bics = [None] * len(best_subsets)
aics = [None] * len(best_subsets)
for idx_set in range(len(best_subsets)):
    X_subset = X.iloc[:, best_subsets[idx_set]].values
    X_subset = sm.tools.tools.add_constant(X_subset)
    result = sm.OLS(y, X_subset).fit()
    adjr2s[idx_set] = result.rsquared_adj
    bics[idx_set] = result.bic
    aics[idx_set] = result.aic

Plotting

As always, we want to create some plots so that we can better visaulize and understand what’s going on. To do this, we’ll first write a function that takes in our metrics:

def makePlotsForBestSubsets(adjr2s, bics, aics, mses):
    markerSize = 8
    titlesFontSize = 18
    axisLabelFontSize = 16
    
    subsetsNmb = len(adjr2s)
    xvals = range(1, subsetsNmb + 1)
    
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10, 8))
    plt.subplots_adjust(wspace=0.45, hspace=0.35) 

    ax1.plot(xvals, adjr2s, '-o', markersize=markerSize)
    ax1.set_ylabel('Adjusted R2', fontsize=titlesFontSize)

    ax2.plot(xvals, bics, '-o', markersize=markerSize)
    ax2.set_ylabel('BIC', fontsize=titlesFontSize)

    ax3.plot(xvals, aics, '-o', markersize=markerSize)
    ax3.set_ylabel('AIC', fontsize=titlesFontSize)

    ax4.plot(xvals, mses, '-o', markersize=markerSize)
    ax4.set_ylabel('MSE', fontsize=titlesFontSize)

    for ax in fig.axes:
        ax.set_xlabel('Number of variables', fontsize=titlesFontSize)
        ax.set_xlim(0.5, subsetsNmb + 1)
        ax.set_xticks(range(2, subsetsNmb + 1, 2));
        for tick in ax.xaxis.get_major_ticks():
            tick.label.set_fontsize(axisLabelFontSize) 
        for tick in ax.yaxis.get_major_ticks():
            tick.label.set_fontsize(axisLabelFontSize)
            
makePlotsForBestSubsets(adjr2s, bics, aics, best_mses)

bt

Forward and Backward Stepwise Selection

Here, we’ll write functions for forward and backward stepwise selection. Forward and backward stepwise selection follows a very similar logic to our best subsect selection, but looks at a more restrictive set of models.

In forward stepwise selection, we start with no predictors. In backward stepwise selection, we start with all the predictors.

Backward Selection

def doBwdSelectStep(X, y, cur_subset):
    best_subset = []
    best_mse = -1
    for feature_idx in cur_subset:
        reduced_subset = list(set(cur_subset) - {feature_idx})
        X_subset = X.iloc[:, reduced_subset]
        lin_reg = LinearRegression(fit_intercept=True, normalize=False)
        lin_reg.fit(X_subset, y)
        yhat = lin_reg.predict(X_subset)
        mse_resid = mean_squared_error(y, yhat)
        if best_mse < 0 or mse_resid < best_mse:
            best_subset = reduced_subset
            best_mse = mse_resid
    return([best_subset, best_mse])

def doBwdStepwiseSelect(X, y, starting_set):
    steps_nmb = len(starting_set)
    best_subsets = [None] * steps_nmb
    best_mses = [None] * steps_nmb

    X_subset = X.iloc[:, starting_set]
    lin_reg = LinearRegression(fit_intercept=True, normalize=False)
    lin_reg.fit(X_subset, y)
    yhat = lin_reg.predict(X_subset)
    mse_resid = mean_squared_error(y, yhat)

    best_subsets[0] = starting_set
    best_mses[0] = mse_resid

    for step in range(steps_nmb - 1):
        best_subsets[step+1], best_mses[step+1] = doBwdSelectStep(X, y, best_subsets[step])
    return([best_subsets, best_mses])

Forward Selection

def doFwdSelectStep(X, y, cur_subset):
    features_nmb = X.shape[1]
    new_features = set(range(features_nmb)) - set(cur_subset)

    best_subset = []
    best_mse = -1
    for feature_idx in new_features:
        increased_subset = cur_subset + [feature_idx]
        X_subset = X.iloc[:, increased_subset]
        lin_reg = LinearRegression(fit_intercept=True, normalize=False)
        lin_reg.fit(X_subset, y)
        yhat = lin_reg.predict(X_subset)
        mse_resid = mean_squared_error(y, yhat)
        if best_mse < 0 or mse_resid < best_mse:
            best_subset = increased_subset
            best_mse = mse_resid
    return([best_subset, best_mse])

def doFwdStepwiseSelect(X, y, starting_set):
    features_nmb = X.shape[1]
    steps_nmb = features_nmb - len(starting_set)
    best_subsets = [None] * steps_nmb
    best_mses = [None] * steps_nmb
    prev_subset = starting_set
    for step in range(steps_nmb):
        best_subsets[step], best_mses[step] = doFwdSelectStep(X, y, prev_subset)
        prev_subset = best_subsets[step]
    return([best_subsets, best_mses])

Executing Forward and Backward Stepwise Selection

best_bwd_subsets, best_bwd_mses = doBwdStepwiseSelect(X, y, list(range(19)))
# Reverse the lists:
best_bwd_subsets = best_bwd_subsets[::-1]
best_bwd_mses = best_bwd_mses[::-1]
bwd_sets_nmb = len(best_bwd_subsets)
bwd_adjr2s = [None] * bwd_sets_nmb
bwd_bics = [None] * bwd_sets_nmb
bwd_aics = [None] * bwd_sets_nmb
for idx_set in range(bwd_sets_nmb):
    X_subset = X.iloc[:, best_bwd_subsets[idx_set]].values
    X_subset = sm.tools.tools.add_constant(X_subset)
    result = sm.OLS(y, X_subset).fit()
    bwd_adjr2s[idx_set] = result.rsquared_adj
    bwd_bics[idx_set] = result.bic
    bwd_aics[idx_set] = result.aic

best_fwd_subsets, best_fwd_mses = doFwdStepwiseSelect(X, y, [])
fwd_sets_nmb = len(best_fwd_subsets)
fwd_adjr2s = [None] * fwd_sets_nmb
fwd_bics = [None] * fwd_sets_nmb
fwd_aics = [None] * fwd_sets_nmb
for idx_set in range(fwd_sets_nmb):
    X_subset = X.iloc[:, best_fwd_subsets[idx_set]].values
    X_subset = sm.tools.tools.add_constant(X_subset)
    result = sm.OLS(y, X_subset).fit()
    fwd_adjr2s[idx_set] = result.rsquared_adj
    fwd_bics[idx_set] = result.bic
    fwd_aics[idx_set] = result.aic

Plotting

Again, let’s make plots for our subsets produced here, using the function we wrote earlier.

Backward Subsets

makePlotsForBestSubsets(bwd_adjr2s, bwd_bics, bwd_aics, best_bwd_mses)

back

Forward Subsets

makePlotsForBestSubsets(fwd_adjr2s, fwd_bics, fwd_aics, best_fwd_mses)

for