Classification
Table of contents
- Getting Started
- Logit
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- k-Nearest Neighbors
- An Application to Caravan Insurance Data
- More Iris Classification
In this tutorial, we will be exploring several classification techniques.
The code in sections 1-6 was provided by Professor Kucheryavyy; I have broken the code down into a few smaller pieces and added some comments and explanations that should help your understanding. Sections 1 - 5 provide in-depth examples of several new classifications techniques for classification problems just involving one class. Section 6 provides a few more examples.
Section 7 is a continuation of my previous tutorial on k-nearest neighbors classification; you can refer to this section for simple examples of the new techniques we learn here, but for a classification problem with multiple classes (in this case, three classes).
You can view the code for this tutorial here.
Getting Started
Importing Libraries
import itertools
import pandas as pd
import numpy as np
import copy
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display
Plot and Output Settings
We’ll also introduce a few extra settings just to make the output of each of our cells a bit nicer:
# Reset all styles to the default:
plt.rcParams.update(plt.rcParamsDefault)
# Then make graphs inline:
%matplotlib inline
# Useful function for Jupyter to display text in bold:
def displaybd(text):
display(Markdown("**" + text + "**"))
If you would like your plots to be a bit larger, please use the following code:
plt.rcParams['figure.figsize'] = (7, 6)
plt.rcParams['font.size'] = 24
plt.rcParams['legend.fontsize'] = 'large'
plt.rcParams['figure.titlesize'] = 'large'
plt.rcParams['lines.markersize'] = 10
Our Dataset
In this tutorial, we we will be using a dataset on the stock market, which can be downloaded here. This dataset is from An Introduction to Statistical Learning, with applications in R (Springer, 2013).
As usual, we can use read_csv
to create a pandas dataframe:
smarket = pd.read_csv('Smarket.csv', parse_dates=False)
Note that this dataset contains a column Direction
, which takes on two different values, either Up
or Down
. To make this column easier to work with in our regressions, we want to represent these values numerically. Let’s have Up
be 1
and Down
be 0
. To do this, we can use np.where
:
smarket["DirectionCode"] = np.where(smarket["Direction"].str.contains("Up"), 1, 0)
Now, let’s get a bit more familiar with our data:
display(smarket[1:10])
Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | Direction | DirectionCode | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2001 | 0.959 | 0.381 | -0.192 | -2.624 | -1.055 | 1.2965 | 1.032 | Up | 1 |
2 | 2001 | 1.032 | 0.959 | 0.381 | -0.192 | -2.624 | 1.4112 | -0.623 | Down | 0 |
3 | 2001 | -0.623 | 1.032 | 0.959 | 0.381 | -0.192 | 1.2760 | 0.614 | Up | 1 |
4 | 2001 | 0.614 | -0.623 | 1.032 | 0.959 | 0.381 | 1.2057 | 0.213 | Up | 1 |
5 | 2001 | 0.213 | 0.614 | -0.623 | 1.032 | 0.959 | 1.3491 | 1.392 | Up | 1 |
6 | 2001 | 1.392 | 0.213 | 0.614 | -0.623 | 1.032 | 1.4450 | -0.403 | Down | 0 |
7 | 2001 | -0.403 | 1.392 | 0.213 | 0.614 | -0.623 | 1.4078 | 0.027 | Up | 1 |
8 | 2001 | 0.027 | -0.403 | 1.392 | 0.213 | 0.614 | 1.1640 | 1.303 | Up | 1 |
9 | 2001 | 1.303 | 0.027 | -0.403 | 1.392 | 0.213 | 1.2326 | 0.287 | Up | 1 |
display(smarket.describe())
Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | DirectionCode | |
---|---|---|---|---|---|---|---|---|---|
count | 1250.000000 | 1250.000000 | 1250.000000 | 1250.000000 | 1250.000000 | 1250.00000 | 1250.000000 | 1250.000000 | 1250.000000 |
mean | 2003.016000 | 0.003834 | 0.003919 | 0.001716 | 0.001636 | 0.00561 | 1.478305 | 0.003138 | 0.518400 |
std | 1.409018 | 1.136299 | 1.136280 | 1.138703 | 1.138774 | 1.14755 | 0.360357 | 1.136334 | 0.499861 |
min | 2001.000000 | -4.922000 | -4.922000 | -4.922000 | -4.922000 | -4.92200 | 0.356070 | -4.922000 | 0.000000 |
25% | 2002.000000 | -0.639500 | -0.639500 | -0.640000 | -0.640000 | -0.64000 | 1.257400 | -0.639500 | 0.000000 |
50% | 2003.000000 | 0.039000 | 0.039000 | 0.038500 | 0.038500 | 0.03850 | 1.422950 | 0.038500 | 1.000000 |
75% | 2004.000000 | 0.596750 | 0.596750 | 0.596750 | 0.596750 | 0.59700 | 1.641675 | 0.596750 | 1.000000 |
max | 2005.000000 | 5.733000 | 5.733000 | 5.733000 | 5.733000 | 5.73300 | 3.152470 | 5.733000 | 1.000000 |
displaybd("Correlations matrix:")
display(smarket.corr())
Correlations matrix:
Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | DirectionCode | |
---|---|---|---|---|---|---|---|---|---|
Year | 1.000000 | 0.029700 | 0.030596 | 0.033195 | 0.035689 | 0.029788 | 0.539006 | 0.030095 | 0.074608 |
Lag1 | 0.029700 | 1.000000 | -0.026294 | -0.010803 | -0.002986 | -0.005675 | 0.040910 | -0.026155 | -0.039757 |
Lag2 | 0.030596 | -0.026294 | 1.000000 | -0.025897 | -0.010854 | -0.003558 | -0.043383 | -0.010250 | -0.024081 |
Lag3 | 0.033195 | -0.010803 | -0.025897 | 1.000000 | -0.024051 | -0.018808 | -0.041824 | -0.002448 | 0.006132 |
Lag4 | 0.035689 | -0.002986 | -0.010854 | -0.024051 | 1.000000 | -0.027084 | -0.048414 | -0.006900 | 0.004215 |
Lag5 | 0.029788 | -0.005675 | -0.003558 | -0.018808 | -0.027084 | 1.000000 | -0.022002 | -0.034860 | 0.005423 |
Volume | 0.539006 | 0.040910 | -0.043383 | -0.041824 | -0.048414 | -0.022002 | 1.000000 | 0.014592 | 0.022951 |
Today | 0.030095 | -0.026155 | -0.010250 | -0.002448 | -0.006900 | -0.034860 | 0.014592 | 1.000000 | 0.730563 |
DirectionCode | 0.074608 | -0.039757 | -0.024081 | 0.006132 | 0.004215 | 0.005423 | 0.022951 | 0.730563 | 1.000000 |
smarket["Volume"].plot()
plt.xlabel("Day");
plt.ylabel("Volume");
Logit
Running Logit via GLM
A generalized linear model usually refers to a model in which the dependent variable \(y\) follows some non-normal distribution with a mean \(\mu\) that is assumed to be some (often nonlinear) function of the independent variable \(x\). Note that generalized linear models are different from general linear models. We will use the generalized linear models from the statsmodels package to run logit:
model = smf.glm("DirectionCode~Lag1+Lag2+Lag3+Lag4+Lag5+Volume", data=smarket,
family=sm.families.Binomial())
res = model.fit()
display(res.summary())
Dep. Variable: | DirectionCode | No. Observations: | 1250 |
---|---|---|---|
Model: | GLM | Df Residuals: | 1243 |
Model Family: | Binomial | Df Model: | 6 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -863.79 |
Date: | Sat, 27 Jun 2020 | Deviance: | 1727.6 |
Time: | 20:46:14 | Pearson chi2: | 1.25e+03 |
No. Iterations: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.1260 | 0.241 | -0.523 | 0.601 | -0.598 | 0.346 |
Lag1 | -0.0731 | 0.050 | -1.457 | 0.145 | -0.171 | 0.025 |
Lag2 | -0.0423 | 0.050 | -0.845 | 0.398 | -0.140 | 0.056 |
Lag3 | 0.0111 | 0.050 | 0.222 | 0.824 | -0.087 | 0.109 |
Lag4 | 0.0094 | 0.050 | 0.187 | 0.851 | -0.089 | 0.107 |
Lag5 | 0.0103 | 0.050 | 0.208 | 0.835 | -0.087 | 0.107 |
Volume | 0.1354 | 0.158 | 0.855 | 0.392 | -0.175 | 0.446 |
Predicted Probabilities and Confusion Matrix
displaybd("Predicted probabilities for the first observations:")
DirectionProbs = res.predict()
print(DirectionProbs[0:10])
DirectionHat = np.where(DirectionProbs > 0.5, "Up", "Down")
confusionDF = pd.crosstab(DirectionHat, smarket["Direction"],
rownames=['Predicted'], colnames=['Actual'],
margins=True)
display(Markdown("***"))
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted market movements:")
print(np.mean(smarket['Direction'] == DirectionHat))
Predicted probabilities for the first observations:
[0.50708413 0.48146788 0.48113883 0.51522236 0.51078116 0.50695646
0.49265087 0.50922916 0.51761353 0.48883778]
Confusion matrix:
Actual | Down | Up | All |
---|---|---|---|
Predicted | |||
Down | 145 | 141 | 286 |
Up | 457 | 507 | 964 |
All | 602 | 648 | 1250 |
Share of correctly predicted market movements:
0.5216
Estimation of Test Error
Here, we’ll first train a model on the data from before 2005, and then test it on the data from after 2005.
train = (smarket['Year'] < 2005)
smarket2005 = smarket[~train]
displaybd("Dimensions of the validation set:")
print(smarket2005.shape)
model = smf.glm("DirectionCode~Lag1+Lag2+Lag3+Lag4+Lag5+Volume", data=smarket,
family=sm.families.Binomial(), subset=train)
res = model.fit()
DirectionProbsTets = res.predict(smarket2005)
DirectionTestHat = np.where(DirectionProbsTets > 0.5, "Up", "Down")
displaybd("Share of correctly predicted market movements in 2005:")
print(np.mean(smarket2005['Direction'] == DirectionTestHat))
Dimensions of the validation set:
(252, 10)
Share of correctly predicted market movements in 2005:
0.4801587301587302
Linear Discriminant Analysis
Linear discriminant analysis is a robust classification method that relies on the following assumptions:
- the class conditional distributions are Gaussian
- these Gaussians have the same covariance matrix (assume homoskedasticity)
Without these assupmtions, linear discriminant analysis is a form of dimenstionality reduction, so it is especially well-suited for high-dimensional data. Thus, we would want to use linear discriminant analysis when we want to reduce the number of features (reduce the dimensionality) while preserving the distinction between our classes.
Custom Output Functions
Before getting started with linear discriminat analysis, we’ll write a couple of our own functions that’ll help display some of our calculations nicely:
def printPriorProbabilities(ldaClasses, ldaPriors):
priorsDF = pd.DataFrame()
for cIdx, cName in enumerate(ldaClasses):
priorsDF[cName] = [ldaPriors[cIdx]];
displaybd('Prior probablities of groups:')
display(Markdown(priorsDF.to_html(index=False)))
def printGroupMeans(ldaClasses, featuresNames, ldaGroupMeans):
displaybd("Group means:")
groupMeansDF = pd.DataFrame(index=ldaClasses)
for fIdx, fName in enumerate(featuresNames):
groupMeansDF[fName] = ldaGroupMeans[:, fIdx]
display(groupMeansDF)
def printLDACoeffs(featuresNames, ldaCoeffs):
coeffDF = pd.DataFrame(index=featuresNames)
for cIdx in range(ldaCoeffs.shape[0]):
colName = "LDA" + str(cIdx + 1)
coeffDF[colName] = ldaCoeffs[cIdx]
displaybd("Coefficients of linear discriminants:")
display(coeffDF)
Fitting an LDA Model
Here, we’ll be using scikit-learn’s Linear Discriminant Analysis
class to fit our model:
outcomeName = 'Direction'
featuresNames = ['Lag1', 'Lag2'];
X_train = smarket.loc[train, featuresNames]
y_train = smarket.loc[train, outcomeName]
lda = LinearDiscriminantAnalysis()
ldaFit = lda.fit(X_train, y_train);
printPriorProbabilities(ldaFit.classes_, ldaFit.priors_)
printGroupMeans(ldaFit.classes_, featuresNames, ldaFit.means_)
printLDACoeffs(featuresNames, ldaFit.coef_)
# Coefficients calcualted by Python's LDA are different from R's LDA
# But they are proportional:
printLDACoeffs(featuresNames, 11.580267503964166 * ldaFit.coef_)
# See this: https://stats.stackexchange.com/questions/87479/what-are-coefficients-of-linear-discriminants-in-lda
Prior probablities of groups:
Down | Up |
---|---|
0.491984 | 0.508016 |
Group means:
Lag1 | Lag2 | |
---|---|---|
Down | 0.042790 | 0.033894 |
Up | -0.039546 | -0.031325 |
Coefficients of linear discriminants:
LDA1 | |
---|---|
Lag1 | -0.055441 |
Lag2 | -0.044345 |
Coefficients of linear discriminants:
LDA1 | |
---|---|
Lag1 | -0.642019 |
Lag2 | -0.513529 |
LDA Predictions
X_test = smarket2005.loc[~train, featuresNames]
y_test = smarket.loc[~train, outcomeName]
y_hat = ldaFit.predict(X_test)
confusionDF = pd.crosstab(y_hat, y_test,
rownames=['Predicted'], colnames=['Actual'],
margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted market movements:")
print(np.mean(y_test == y_hat))
Confusion matrix:
Actual | Down | Up | All |
---|---|---|---|
Predicted | |||
Down | 35 | 35 | 70 |
Up | 76 | 106 | 182 |
All | 111 | 141 | 252 |
Share of correctly predicted market movements:
0.5595238095238095
Posterior Probabilities
Here, we’ll estimate posterior propbabilities, using scikit-learn’s predict_proba
function:
pred_p = lda.predict_proba(X_test)
# pred_p is an array of shape (number of observations) x (number of classes)
upNmb = np.sum(pred_p[:, 1] > 0.5)
displaybd("Number of upward movements with threshold 0.5: " + str(upNmb))
upNmb = np.sum(pred_p[:, 1] > 0.9)
displaybd("Number of upward movements with threshold 0.9: " + str(upNmb))
Number of upward movements with threshold 0.5: 182
Number of upward movements with threshold 0.9: 0
Quadratic Discriminant Analysis
Quadratic discriminant analysis is a generalization of linear discriminant analysis as a classifier, but it does not make the same covariance assumption.
Fitting a QDA Model
Here, we’ll be using scikit-learn’s Quadratic Discriminant Analysis
class to fit our model:
qda = QuadraticDiscriminantAnalysis()
qdaFit = qda.fit(X_train, y_train);
printPriorProbabilities(qdaFit.classes_, qdaFit.priors_)
printGroupMeans(qdaFit.classes_, featuresNames, qdaFit.means_)
Prior probablities of groups:
Down | Up |
---|---|
0.491984 | 0.508016 |
Group means:
Lag1 | Lag2 | |
---|---|---|
Down | 0.042790 | 0.033894 |
Up | -0.039546 | -0.031325 |
QDA Predictions
y_hat = qdaFit.predict(X_test)
confusionDF = pd.crosstab(y_hat, y_test,
rownames=['Predicted'], colnames=['Actual'],
margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted market movements:")
print(np.mean(y_test == y_hat))
Confusion matrix:
Actual | Down | Up | All |
---|---|---|---|
Predicted | |||
Down | 30 | 20 | 50 |
Up | 81 | 121 | 202 |
All | 111 | 141 | 252 |
Share of correctly predicted market movements:
0.5992063492063492
k-Nearest Neighbors
Here, we’ll be looking at k-nearest neighbors, which we talked about in lecture 02 of this course. Tutorial 02 was also on k-nearest neighbors classification, so please refer to that tutorial for an additional examples and explanations.
One Neighbor
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
y_hat = knn.fit(X_train, y_train).predict(X_test)
confusionDF = pd.crosstab(y_hat, y_test,
rownames=['Predicted'], colnames=['Actual'],
margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted market movements:")
print(np.mean(y_test == y_hat))
Confusion matrix:
Actual | Down | Up | All |
---|---|---|---|
Predicted | |||
Down | 43 | 58 | 101 |
Up | 68 | 83 | 151 |
All | 111 | 141 | 252 |
Share of correctly predicted market movements:
0.5
Three Neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
y_hat = knn.fit(X_train, y_train).predict(X_test)
confusionDF = pd.crosstab(y_hat, y_test,
rownames=['Predicted'], colnames=['Actual'],
margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted market movements:")
print(np.mean(y_test == y_hat))
Confusion matrix:
Actual | Down | Up | All |
---|---|---|---|
Predicted | |||
Down | 48 | 55 | 103 |
Up | 63 | 86 | 149 |
All | 111 | 141 | 252 |
Share of correctly predicted market movements:
0.5317460317460317
An Application to Caravan Insurance Data
This section will demonstrate the use of two techniques we learned above, KNN and logit.
A New Dataset
We’ll be using a new dataset that contains information on customers of an insurance company. You can see a detailed description of this dataset here.
Loading Our Dataset
caravan = pd.read_csv('Caravan.csv', index_col=0)
display(caravan.describe())
display(caravan.describe(include=[np.object]))
MOSTYPE | MAANTHUI | MGEMOMV | MGEMLEEF | MOSHOOFD | MGODRK | MGODPR | MGODOV | MGODGE | MRELGE | ... | ALEVEN | APERSONG | AGEZONG | AWAOREG | ABRAND | AZEILPL | APLEZIER | AFIETS | AINBOED | ABYSTAND | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | ... | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 | 5822.000000 |
mean | 24.253349 | 1.110615 | 2.678805 | 2.991240 | 5.773617 | 0.696496 | 4.626932 | 1.069907 | 3.258502 | 6.183442 | ... | 0.076606 | 0.005325 | 0.006527 | 0.004638 | 0.570079 | 0.000515 | 0.006012 | 0.031776 | 0.007901 | 0.014256 |
std | 12.846706 | 0.405842 | 0.789835 | 0.814589 | 2.856760 | 1.003234 | 1.715843 | 1.017503 | 1.597647 | 1.909482 | ... | 0.377569 | 0.072782 | 0.080532 | 0.077403 | 0.562058 | 0.022696 | 0.081632 | 0.210986 | 0.090463 | 0.119996 |
min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 10.000000 | 1.000000 | 2.000000 | 2.000000 | 3.000000 | 0.000000 | 4.000000 | 0.000000 | 2.000000 | 5.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 30.000000 | 1.000000 | 3.000000 | 3.000000 | 7.000000 | 0.000000 | 5.000000 | 1.000000 | 3.000000 | 6.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 35.000000 | 1.000000 | 3.000000 | 3.000000 | 8.000000 | 1.000000 | 6.000000 | 2.000000 | 4.000000 | 7.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 41.000000 | 10.000000 | 5.000000 | 6.000000 | 10.000000 | 9.000000 | 9.000000 | 5.000000 | 9.000000 | 9.000000 | ... | 8.000000 | 1.000000 | 1.000000 | 2.000000 | 7.000000 | 1.000000 | 2.000000 | 3.000000 | 2.000000 | 2.000000 |
8 rows × 85 columns
Purchase | |
---|---|
count | 5822 |
unique | 2 |
top | No |
freq | 5474 |
Standardizing Our Data
y = caravan.Purchase
X = caravan.drop('Purchase', axis=1).astype('float64')
X_scaled = preprocessing.scale(X)
Splitting Data into Train and Test Data
X_train = X_scaled[1000:,:]
y_train = y[1000:]
X_test = X_scaled[:1000,:]
y_test = y[:1000]
Using KNN for Prediction
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
y_hat = knn.fit(X_train, y_train).predict(X_test)
confusionDF = pd.crosstab(y_hat, y_test,
rownames=['Predicted'], colnames=['Actual'],
margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted purchases:")
print(np.mean(y_test == y_hat))
Confusion matrix:
Actual | No | Yes | All |
---|---|---|---|
Predicted | |||
No | 873 | 50 | 923 |
Yes | 68 | 9 | 77 |
All | 941 | 59 | 1000 |
Share of correctly predicted purchases:
0.882
Logit
X_train_w_constant = sm.add_constant(X_train)
X_test_w_constant = sm.add_constant(X_test, has_constant='add')
y_train_code = np.where(y_train == "No", 0, 1)
res = sm.GLM(y_train_code, X_train_w_constant, family=sm.families.Binomial()).fit()
y_hat_code = res.predict(X_test_w_constant)
PurchaseHat = np.where(y_hat_code > 0.25, "Yes", "No")
confusionDF = pd.crosstab(PurchaseHat, y_test,
rownames=['Predicted'], colnames=['Actual'],
margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
Confusion matrix:
Actual | No | Yes | All |
---|---|---|---|
Predicted | |||
No | 919 | 48 | 967 |
Yes | 22 | 11 | 33 |
All | 941 | 59 | 1000 |
More Iris Classification
Here, we will apply some of the new techniques we learned above to the iris classification problem we explored using k-nearest neighbors in Tutorial 02.
Our Dataset
I’ve included some of the important descriptions from Tutorial 02 in this tutorial as well, but please review tutorial 02 for more details on how we initially set up and process our dataset.
As a reminder, we are using the iris data set from the University of California, Irvine and are attempting to classify types of irises using the following four attributes:
- Sepal length
- Sepal width
- Petal length
- Petal width
There are three types of irises:
- Iris Setosa
- Iris Versicolor
- Iris Virginica
Importing Our Dataset
Let’s import the data set as a pandas
dataframe:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']
iris_df = pd.read_csv(url, names=names)
Splitting Data into Train and Test Data
Let’s first define our \(X\) and \(y\) variables:
X = iris_df.iloc[:, :-1] #attributes, iloc[:, :-1] means until the last column
y = iris_df['type'] #labels
Now, et’s split our data into 80% training data and 20% testing data. We can do this using train_test_split
and its train_size
parameter:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80)
Feature Scaling
Now, we want to perform some feature scaling to normalize the range of our independent variables.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Logit
Let’s first take a look at how we might apply logit to our iris classification problem. You may apply logit using the techniques we learned above (using GLM), but I will show you one other method we can employ using scikit-learn’s Logistic Regression
class, as we can consider logit and logistic regression to be the same thing
Fitting Our Model
Let’s import the Logistic Regression
class and fit our model as follows:
from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression()
logit_model.fit(X_train, y_train)
Making Predictions
Then, we’ll make some predictions and store them in a variable called y_pred
:
y_pred = logit_model.predict(X_test)
Evaluating Our Predictions
Like we did in Tutorial 02, let’s make a classification report and confusion matrix.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision | recall | f1-score | support | |
---|---|---|---|---|
Iris-setosa | 1.00 | 1.00 | 1.00 | 9 |
Iris-versicolor | 1.00 | 0.70 | 0.82 | 10 |
Iris-virginica | 0.79 | 1.00 | 0.88 | 11 |
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
index = ['setosa','versicolor','virginica'],
columns = ['setosa','versicolor','virginica'])
sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Using this heat map, we can make the following observations:
- All setosa flowers were correctly classified by our model.
- Seven versicolor flowers were correctly classified, and three versicolor flowers were incorrectly classified as virginica flowers.
- All virginica flowers were correctly classified by our model.
Again, you may not get the same exact classification report or confusion matrix, but this is normal, as your results will vary each time you run your model.
Linear Discriminant Analysis
Let’s now try using linear discriminant analysis for our classification.
Fitting Our Model
Again, let’s use the Linear Discriminant Analysis
class to fit our model:
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)
Making Predictions
Then, we’ll make some predictions and store them in a variable called y_pred
:
y_pred = lda_model.predict(X_test)
Evaluating Our Predictions
Like we did in Tutorial 02, let’s make a classification report and confusion matrix. If you want, you can also use the functions printPriorProbabilities()
, printGroupMeans()
, and printLDACoeffs()
that we wrote earlier, but here I’ll keep it simple and just look at our classification report and heatmap like we did just earlier.
print(classification_report(y_test, y_pred))
precision | recall | f1-score | support | |
---|---|---|---|---|
Iris-setosa | 1.00 | 1.00 | 1.00 | 9 |
Iris-versicolor | 1.00 | 1.00 | 1.00 | 10 |
Iris-virginica | 1.00 | 1.00 | 1.00 | 11 |
In thise case, we can see our model did very well. Let’s also take a look at the heatmap to see that a little bit more easily:
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
index = ['setosa','versicolor','virginica'],
columns = ['setosa','versicolor','virginica'])
sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Using this heat map, we can make the following observations:
- All setosa flowers were correctly classified by our model.
- All versicolors were correctly classified by our model.
- All virginica flowers were correctly classified by our model.
Again, you may not get the same exact classification report or confusion matrix, but this is normal, as your results will vary each time you run your model.
Quadratic Discriminant Analysis
Let’s now try using quadratic discriminant analysis for our classification.
Fitting Our Model
Again, let’s use the Linear Discriminant Analysis
class to fit our model:
qda_model = QuadraticDiscriminantAnalysis()
qda_model.fit(X_train, y_train)
Making Predictions
Then, we’ll make some predictions and store them in a variable called y_pred
:
y_pred = qda_model.predict(X_test)
Evaluating Our Predictions
Like we did in Tutorial 02, let’s make a classification report and confusion matrix. If you want, you can also use the functions printPriorProbabilities()
and printGroupMeans()
that we wrote earlier, but here I’ll keep it simple and just look at our classification report and heatmap like we did just earlier.
print(classification_report(y_test, y_pred))
precision | recall | f1-score | support | |
---|---|---|---|---|
Iris-setosa | 1.00 | 1.00 | 1.00 | 9 |
Iris-versicolor | 1.00 | 0.80 | 0.89 | 10 |
Iris-virginica | 0.85 | 1.00 | 0.92 | 11 |
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
index = ['setosa','versicolor','virginica'],
columns = ['setosa','versicolor','virginica'])
sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Using this heat map, we can make the following observations:
- All setosa flowers were correctly classified by our model.
- Eight versicolor flowers were correctly classified, and two versicolor flowers were incorrectly classified as virginica flowers.
- All virginica flowers were correctly classified by our model.
Again, you may not get the same exact classification report or confusion matrix, but this is normal, as your results will vary each time you run your model.