Link

Bootstrapping for Linear Regression

Table of contents

  1. Importing Libraries
  2. Our Data Set
  3. Visualizing Our Data
  4. Fitting the Model
  5. Using Bootstrap
  6. Constructing a Confidence Interval

You can view the code for this tutorial here.

In Homework 02, you fitted a linear regression model for your data, which may have looked something like this:

\[f_{\hat{\theta}}(x) = \hat{\theta_0} + \hat{\theta_1}x_1 + \cdots + \hat{\theta_p}x_p\]

Now, we would like to construct confidence intervals for the estimated coefficients \(\hat{\theta_0}, \hat{\theta_1}, \cdots \hat{\theta_p} \) and perhaps infer the true coefficients of our model. Bootstrap is a computational method that allows us to calculate standard errors and confidence intervals for our parameters.

Importing Libraries

As usual, we will want to use numpy, pandas, matplot, and seaborn to help us manipulate and visualize our data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Our Data Set

In this example, I’ll be estimating the price of houses in King County, Washington.

You can obtain the dataset here.

There are 21 columns in this dataset, but for the purposes of this tutorial, we will just look at a couple. Let’s look at the following parameters to estimate price:

  • sqft_living: the size, in square feet, of the living area of the house
  • yr_built: the year in which the house was built

After downloading, read our data using something like the following:

df = pd.read_csv("kc_house_data.csv", usecols=["price", "sqft_living", "yr_built"])

Visualizing Our Data

Before doing anything, let’s get a little bit more familiar with our data. As we learned in Tutorial 02, first use head() to see the first five rows of our dataframe:

df.head()
 pricesqft_livingyr_built
0221900.011801955
1538000.025701951
2180000.07701933
3604000.019601965
4510000.016801987

Instead of looking at the year built, we’ll look at the age of the house:

df["age of house"] = 2015 - df["yr_built"]
 pricesqft_livingyr_builtage of house
0221900.01180195560
1538000.02570195164
2180000.0770193382
3604000.01960196550
4510000.01680198728

In reality, there’s also a column yr_renovated in the original dataset, but for simplicity, we will ignore this in this tutorial.

Let’s also create scatter plots to visualize the relationships between our variables:

plt.xlabel("sqft_living")
plt.ylabel("price")
plt.scatter(x=df["sqft_living"], y=df["price"])

sqftbyprice

plt.xlabel("age of house")
plt.ylabel("price")
plt.scatter(x=df["age of house"], y=df["price"])

ageprice

Fitting the Model

First, let’s organize our data into X and y, what we’re trying to predict (price, in our case).

X = df.loc[:, ["sqft_living", "age of house"]]
y = df.loc[:, "price"]

Note: Visually, we can see that the age of the home does not seem to be strongly correlated with the price of the home. In a real data analysis, we may want to investigate further and/or remove this from our regression.

Now, let’s use scikit-learn to perform linear regression (there’s no need to write the code for this on our own):

import sklearn.linear_model as lm

linear_model = lm.LinearRegression()
linear_model.fit(X, y)

print("""
intercept: %.2f
sqft_living: %.2f
age of house: %.2f
""" % (tuple([linear_model.intercept_]) + tuple(linear_model.coef_)))

intercept: -196929.07

sqft_living: 304.57

age of house: 2353.73

Above are the (estimates of) coefficients we get from our linear regression model. Now, we want to bootstrap our observations.

Using Bootstrap

For the purposes of this tutorial, we can write a simple resampling function using random integers generated by NumPy (note that these are not true random numbers):

def simple_resample(n): 
    return(np.random.randint(low = 0, high = n, size = n))

Now, let’s write out a very general bootstrap function that takes in a bootstrap population (not the true population), some statistic, a resampling function (set default to our simple_resample), and the amount of replicates (set default to 10000):

def bootstrap(boot_pop, statistic, resample = simple_resample, replicates = 10000):
    n = len(boot_pop)
    resample_estimates = np.array([statistic(boot_pop[resample(n)]) for _ in range(replicates)])
    return resample_estimates

Let’s just focus on the coefficient for sqft_living. We will sample with replacement from our bootstrap sample (our original data), and fit a new linear regression model to the sampled data. The coefficient for sqft_living will be used as our bootstrap statistic:

def sqft_coeff(data_array):
    X = data_array[:, 1:]
    y = data_array[:, 0]
    
    linear_model = lm.LinearRegression()
    model = linear_model.fit(X, y)
    theta_sqft = model.coef_[1]

    return theta_sqft

data_array = df.loc[:,  ["price", "age of house", "sqft_living"]].values

theta_hat_sampling = bootstrap(data_array, sqft_coeff)

Constructing a Confidence Interval

First, let’s construct a histogram of our sampling distribution to get a visual approximation:

plt.hist(theta_hat_sampling, bins = 30, density = True)
plt.xlabel("$\\tilde{\\theta}_{sqft}$ Values")
plt.ylabel("Proportion per Unit")
plt.title("Bootstrap Sampling Distribution of $\\tilde{\\theta}_{sqft}$ (Nonparametric)");
plt.show()

bt

Though we cannot direclty measure the true coefficient, we can construct a confidence interval to account for variability in the sample coefficient. If we construct a 95% confidence interval, we need to look at the 2.5th percentile and 97.5th percentile as our endpoints:

np.percentile(theta_hat_sampling, 2.5), np.percentile(theta_hat_sampling, 97.5)

(293.0643343501463, 316.5597894314288)