Bootstrapping for Linear Regression

Importing Libraries
Our Data Set
Visualizing Our Data
Fitting the Model
Using Bootstrap
Constructing a Confidence Interval

You can view the code for this tutorial here.

In Homework 02, you fitted a linear regression model for your data, which may have looked something like this:

\[f_{\hat{\theta}}(x) = \hat{\theta_0} + \hat{\theta_1}x_1 + \cdots + \hat{\theta_p}x_p\]

Now, we would like to construct conﬁdence intervals for the estimated coefﬁcients \(\hat{\theta_0}, \hat{\theta_1}, \cdots \hat{\theta_p} \) and perhaps infer the true coefficients of our model. Bootstrap is a computational method that allows us to calculate standard errors and confidence intervals for our parameters.

Importing Libraries

As usual, we will want to use numpy, pandas, matplot, and seaborn to help us manipulate and visualize our data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Our Data Set

In this example, I’ll be estimating the price of houses in King County, Washington.

You can obtain the dataset here.

There are 21 columns in this dataset, but for the purposes of this tutorial, we will just look at a couple. Let’s look at the following parameters to estimate price:

sqft_living: the size, in square feet, of the living area of the house
yr_built: the year in which the house was built

After downloading, read our data using something like the following:

df = pd.read_csv("kc_house_data.csv", usecols=["price", "sqft_living", "yr_built"])

Visualizing Our Data

Before doing anything, let’s get a little bit more familiar with our data. As we learned in Tutorial 02, first use head() to see the first five rows of our dataframe:

df.head()

	price	sqft_living	yr_built
0	221900.0	1180	1955
1	538000.0	2570	1951
2	180000.0	770	1933
3	604000.0	1960	1965
4	510000.0	1680	1987

Instead of looking at the year built, we’ll look at the age of the house:

df["age of house"] = 2015 - df["yr_built"]

	price	sqft_living	yr_built	age of house
0	221900.0	1180	1955	60
1	538000.0	2570	1951	64
2	180000.0	770	1933	82
3	604000.0	1960	1965	50
4	510000.0	1680	1987	28

In reality, there’s also a column yr_renovated in the original dataset, but for simplicity, we will ignore this in this tutorial.

Let’s also create scatter plots to visualize the relationships between our variables:

plt.xlabel("sqft_living")
plt.ylabel("price")
plt.scatter(x=df["sqft_living"], y=df["price"])

sqftbyprice

plt.xlabel("age of house")
plt.ylabel("price")
plt.scatter(x=df["age of house"], y=df["price"])

ageprice

Fitting the Model

First, let’s organize our data into X and y, what we’re trying to predict (price, in our case).

X = df.loc[:, ["sqft_living", "age of house"]]
y = df.loc[:, "price"]

Note: Visually, we can see that the age of the home does not seem to be strongly correlated with the price of the home. In a real data analysis, we may want to investigate further and/or remove this from our regression.

Now, let’s use scikit-learn to perform linear regression (there’s no need to write the code for this on our own):

import sklearn.linear_model as lm

linear_model = lm.LinearRegression()
linear_model.fit(X, y)

print("""
intercept: %.2f
sqft_living: %.2f
age of house: %.2f
""" % (tuple([linear_model.intercept_]) + tuple(linear_model.coef_)))

intercept: -196929.07

sqft_living: 304.57

age of house: 2353.73

Above are the (estimates of) coefficients we get from our linear regression model. Now, we want to bootstrap our observations.

Using Bootstrap

For the purposes of this tutorial, we can write a simple resampling function using random integers generated by NumPy (note that these are not true random numbers):

def simple_resample(n): 
    return(np.random.randint(low = 0, high = n, size = n))

Now, let’s write out a very general bootstrap function that takes in a bootstrap population (not the true population), some statistic, a resampling function (set default to our simple_resample), and the amount of replicates (set default to 10000):

def bootstrap(boot_pop, statistic, resample = simple_resample, replicates = 10000):
    n = len(boot_pop)
    resample_estimates = np.array([statistic(boot_pop[resample(n)]) for _ in range(replicates)])
    return resample_estimates

Let’s just focus on the coefficient for sqft_living. We will sample with replacement from our bootstrap sample (our original data), and fit a new linear regression model to the sampled data. The coefficient for sqft_living will be used as our bootstrap statistic:

def sqft_coeff(data_array):
    X = data_array[:, 1:]
    y = data_array[:, 0]
    
    linear_model = lm.LinearRegression()
    model = linear_model.fit(X, y)
    theta_sqft = model.coef_[1]

    return theta_sqft

data_array = df.loc[:,  ["price", "age of house", "sqft_living"]].values

theta_hat_sampling = bootstrap(data_array, sqft_coeff)

Constructing a Confidence Interval

First, let’s construct a histogram of our sampling distribution to get a visual approximation:

plt.hist(theta_hat_sampling, bins = 30, density = True)
plt.xlabel("$\\tilde{\\theta}_{sqft}$ Values")
plt.ylabel("Proportion per Unit")
plt.title("Bootstrap Sampling Distribution of $\\tilde{\\theta}_{sqft}$ (Nonparametric)");
plt.show()

Though we cannot direclty measure the true coefficient, we can construct a confidence interval to account for variability in the sample coefficient. If we construct a 95% confidence interval, we need to look at the 2.5th percentile and 97.5th percentile as our endpoints:

np.percentile(theta_hat_sampling, 2.5), np.percentile(theta_hat_sampling, 97.5)

(293.0643343501463, 316.5597894314288)