Classification

Getting Started
Logit
Linear Discriminant Analysis
Quadratic Discriminant Analysis
1. Fitting a QDA Model
2. QDA Predictions
k-Nearest Neighbors
1. One Neighbor
2. Three Neighbors
An Application to Caravan Insurance Data
More Iris Classification

In this tutorial, we will be exploring several classification techniques.

The code in sections 1-6 was provided by Professor Kucheryavyy; I have broken the code down into a few smaller pieces and added some comments and explanations that should help your understanding. Sections 1 - 5 provide in-depth examples of several new classifications techniques for classification problems just involving one class. Section 6 provides a few more examples.

Section 7 is a continuation of my previous tutorial on k-nearest neighbors classification; you can refer to this section for simple examples of the new techniques we learn here, but for a classification problem with multiple classes (in this case, three classes).

You can view the code for this tutorial here.

Getting Started

Importing Libraries

import itertools
import pandas as pd
import numpy as np
import copy

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display

Plot and Output Settings

We’ll also introduce a few extra settings just to make the output of each of our cells a bit nicer:

# Reset all styles to the default:
plt.rcParams.update(plt.rcParamsDefault)
# Then make graphs inline:
%matplotlib inline

# Useful function for Jupyter to display text in bold:
def displaybd(text):
    display(Markdown("**" + text + "**"))

If you would like your plots to be a bit larger, please use the following code:

plt.rcParams['figure.figsize'] = (7, 6)
plt.rcParams['font.size'] = 24
plt.rcParams['legend.fontsize'] = 'large'
plt.rcParams['figure.titlesize'] = 'large'
plt.rcParams['lines.markersize'] = 10

Our Dataset

In this tutorial, we we will be using a dataset on the stock market, which can be downloaded here. This dataset is from An Introduction to Statistical Learning, with applications in R (Springer, 2013).

As usual, we can use read_csv to create a pandas dataframe:

smarket = pd.read_csv('Smarket.csv', parse_dates=False)

Note that this dataset contains a column Direction, which takes on two different values, either Up or Down. To make this column easier to work with in our regressions, we want to represent these values numerically. Let’s have Up be 1 and Down be 0. To do this, we can use np.where:

smarket["DirectionCode"] = np.where(smarket["Direction"].str.contains("Up"), 1, 0)

Now, let’s get a bit more familiar with our data:

display(smarket[1:10])

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today	Direction	DirectionCode
1	2001	0.959	0.381	-0.192	-2.624	-1.055	1.2965	1.032	Up	1
2	2001	1.032	0.959	0.381	-0.192	-2.624	1.4112	-0.623	Down	0
3	2001	-0.623	1.032	0.959	0.381	-0.192	1.2760	0.614	Up	1
4	2001	0.614	-0.623	1.032	0.959	0.381	1.2057	0.213	Up	1
5	2001	0.213	0.614	-0.623	1.032	0.959	1.3491	1.392	Up	1
6	2001	1.392	0.213	0.614	-0.623	1.032	1.4450	-0.403	Down	0
7	2001	-0.403	1.392	0.213	0.614	-0.623	1.4078	0.027	Up	1
8	2001	0.027	-0.403	1.392	0.213	0.614	1.1640	1.303	Up	1
9	2001	1.303	0.027	-0.403	1.392	0.213	1.2326	0.287	Up	1

display(smarket.describe())

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today	DirectionCode
count	1250.000000	1250.000000	1250.000000	1250.000000	1250.000000	1250.00000	1250.000000	1250.000000	1250.000000
mean	2003.016000	0.003834	0.003919	0.001716	0.001636	0.00561	1.478305	0.003138	0.518400
std	1.409018	1.136299	1.136280	1.138703	1.138774	1.14755	0.360357	1.136334	0.499861
min	2001.000000	-4.922000	-4.922000	-4.922000	-4.922000	-4.92200	0.356070	-4.922000	0.000000
25%	2002.000000	-0.639500	-0.639500	-0.640000	-0.640000	-0.64000	1.257400	-0.639500	0.000000
50%	2003.000000	0.039000	0.039000	0.038500	0.038500	0.03850	1.422950	0.038500	1.000000
75%	2004.000000	0.596750	0.596750	0.596750	0.596750	0.59700	1.641675	0.596750	1.000000
max	2005.000000	5.733000	5.733000	5.733000	5.733000	5.73300	3.152470	5.733000	1.000000

displaybd("Correlations matrix:")
display(smarket.corr())

Correlations matrix:

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today	DirectionCode
Year	1.000000	0.029700	0.030596	0.033195	0.035689	0.029788	0.539006	0.030095	0.074608
Lag1	0.029700	1.000000	-0.026294	-0.010803	-0.002986	-0.005675	0.040910	-0.026155	-0.039757
Lag2	0.030596	-0.026294	1.000000	-0.025897	-0.010854	-0.003558	-0.043383	-0.010250	-0.024081
Lag3	0.033195	-0.010803	-0.025897	1.000000	-0.024051	-0.018808	-0.041824	-0.002448	0.006132
Lag4	0.035689	-0.002986	-0.010854	-0.024051	1.000000	-0.027084	-0.048414	-0.006900	0.004215
Lag5	0.029788	-0.005675	-0.003558	-0.018808	-0.027084	1.000000	-0.022002	-0.034860	0.005423
Volume	0.539006	0.040910	-0.043383	-0.041824	-0.048414	-0.022002	1.000000	0.014592	0.022951
Today	0.030095	-0.026155	-0.010250	-0.002448	-0.006900	-0.034860	0.014592	1.000000	0.730563
DirectionCode	0.074608	-0.039757	-0.024081	0.006132	0.004215	0.005423	0.022951	0.730563	1.000000

smarket["Volume"].plot()
plt.xlabel("Day");
plt.ylabel("Volume");

smarket

Logit

Running Logit via GLM

A generalized linear model usually refers to a model in which the dependent variable \(y\) follows some non-normal distribution with a mean \(\mu\) that is assumed to be some (often nonlinear) function of the independent variable \(x\). Note that generalized linear models are different from general linear models. We will use the generalized linear models from the statsmodels package to run logit:

model = smf.glm("DirectionCode~Lag1+Lag2+Lag3+Lag4+Lag5+Volume", data=smarket,
                family=sm.families.Binomial())
res = model.fit()
display(res.summary())

Generalized Linear Model Regression Results
Dep. Variable:	DirectionCode	No. Observations:	1250
Model:	GLM	Df Residuals:	1243
Model Family:	Binomial	Df Model:	6
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-863.79
Date:	Sat, 27 Jun 2020	Deviance:	1727.6
Time:	20:46:14	Pearson chi2:	1.25e+03
No. Iterations:	4
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-0.1260	0.241	-0.523	0.601	-0.598	0.346
Lag1	-0.0731	0.050	-1.457	0.145	-0.171	0.025
Lag2	-0.0423	0.050	-0.845	0.398	-0.140	0.056
Lag3	0.0111	0.050	0.222	0.824	-0.087	0.109
Lag4	0.0094	0.050	0.187	0.851	-0.089	0.107
Lag5	0.0103	0.050	0.208	0.835	-0.087	0.107
Volume	0.1354	0.158	0.855	0.392	-0.175	0.446

Predicted Probabilities and Confusion Matrix

displaybd("Predicted probabilities for the first observations:")
DirectionProbs = res.predict()
print(DirectionProbs[0:10])

DirectionHat = np.where(DirectionProbs > 0.5, "Up", "Down")
confusionDF = pd.crosstab(DirectionHat, smarket["Direction"],
                          rownames=['Predicted'], colnames=['Actual'],
                          margins=True)
display(Markdown("***"))
displaybd("Confusion matrix:")
display(confusionDF)

displaybd("Share of correctly predicted market movements:")
print(np.mean(smarket['Direction'] == DirectionHat))

Predicted probabilities for the first observations:

[0.50708413 0.48146788 0.48113883 0.51522236 0.51078116 0.50695646
 0.49265087 0.50922916 0.51761353 0.48883778]

Confusion matrix:

Actual	Down	Up	All
Predicted
Down	145	141	286
Up	457	507	964
All	602	648	1250

Share of correctly predicted market movements:

0.5216

Estimation of Test Error

Here, we’ll first train a model on the data from before 2005, and then test it on the data from after 2005.

train = (smarket['Year'] < 2005)
smarket2005 = smarket[~train]
displaybd("Dimensions of the validation set:")
print(smarket2005.shape)

model = smf.glm("DirectionCode~Lag1+Lag2+Lag3+Lag4+Lag5+Volume", data=smarket,
                family=sm.families.Binomial(), subset=train)
res = model.fit()

DirectionProbsTets = res.predict(smarket2005)
DirectionTestHat = np.where(DirectionProbsTets > 0.5, "Up", "Down")
displaybd("Share of correctly predicted market movements in 2005:")
print(np.mean(smarket2005['Direction'] == DirectionTestHat))

Dimensions of the validation set:

(252, 10)

Share of correctly predicted market movements in 2005:

0.4801587301587302

Linear Discriminant Analysis

Linear discriminant analysis is a robust classification method that relies on the following assumptions:

the class conditional distributions are Gaussian
these Gaussians have the same covariance matrix (assume homoskedasticity)

Without these assupmtions, linear discriminant analysis is a form of dimenstionality reduction, so it is especially well-suited for high-dimensional data. Thus, we would want to use linear discriminant analysis when we want to reduce the number of features (reduce the dimensionality) while preserving the distinction between our classes.

Custom Output Functions

Before getting started with linear discriminat analysis, we’ll write a couple of our own functions that’ll help display some of our calculations nicely:

def printPriorProbabilities(ldaClasses, ldaPriors):
    priorsDF = pd.DataFrame()
    for cIdx, cName in enumerate(ldaClasses):
        priorsDF[cName] = [ldaPriors[cIdx]];
    displaybd('Prior probablities of groups:')
    display(Markdown(priorsDF.to_html(index=False)))
    
def printGroupMeans(ldaClasses, featuresNames, ldaGroupMeans):
    displaybd("Group means:")
    groupMeansDF = pd.DataFrame(index=ldaClasses)
    for fIdx, fName in enumerate(featuresNames):
         groupMeansDF[fName] = ldaGroupMeans[:, fIdx]
    display(groupMeansDF)

def printLDACoeffs(featuresNames, ldaCoeffs):
    coeffDF = pd.DataFrame(index=featuresNames)
    for cIdx in range(ldaCoeffs.shape[0]):
        colName = "LDA" + str(cIdx + 1)
        coeffDF[colName] = ldaCoeffs[cIdx]
    displaybd("Coefficients of linear discriminants:")
    display(coeffDF)

Fitting an LDA Model

Here, we’ll be using scikit-learn’s Linear Discriminant Analysis class to fit our model:

outcomeName = 'Direction'
featuresNames = ['Lag1', 'Lag2'];

X_train = smarket.loc[train, featuresNames]
y_train = smarket.loc[train, outcomeName]

lda = LinearDiscriminantAnalysis()
ldaFit = lda.fit(X_train, y_train);

printPriorProbabilities(ldaFit.classes_, ldaFit.priors_)
printGroupMeans(ldaFit.classes_, featuresNames, ldaFit.means_)
printLDACoeffs(featuresNames, ldaFit.coef_)
# Coefficients calcualted by Python's LDA are different from R's LDA
# But they are proportional:
printLDACoeffs(featuresNames, 11.580267503964166 * ldaFit.coef_)
# See this: https://stats.stackexchange.com/questions/87479/what-are-coefficients-of-linear-discriminants-in-lda

Prior probablities of groups:

Down	Up
0.491984	0.508016

Group means:

	Lag1	Lag2
Down	0.042790	0.033894
Up	-0.039546	-0.031325

Coefficients of linear discriminants:

	LDA1
Lag1	-0.055441
Lag2	-0.044345

Coefficients of linear discriminants:

	LDA1
Lag1	-0.642019
Lag2	-0.513529

LDA Predictions

X_test = smarket2005.loc[~train, featuresNames]
y_test = smarket.loc[~train, outcomeName]
y_hat = ldaFit.predict(X_test)

confusionDF = pd.crosstab(y_hat, y_test,
                          rownames=['Predicted'], colnames=['Actual'],
                          margins=True)
displaybd("Confusion matrix:")
display(confusionDF)

displaybd("Share of correctly predicted market movements:")
print(np.mean(y_test == y_hat))

Confusion matrix:

Actual	Down	Up	All
Predicted
Down	35	35	70
Up	76	106	182
All	111	141	252

Share of correctly predicted market movements:

0.5595238095238095

Posterior Probabilities

Here, we’ll estimate posterior propbabilities, using scikit-learn’s predict_proba function:

pred_p = lda.predict_proba(X_test)
# pred_p is an array of shape (number of observations) x (number of classes)

upNmb = np.sum(pred_p[:, 1] > 0.5)
displaybd("Number of upward movements with threshold 0.5: " + str(upNmb))

upNmb = np.sum(pred_p[:, 1] > 0.9)
displaybd("Number of upward movements with threshold 0.9: " + str(upNmb))

Number of upward movements with threshold 0.5: 182

Number of upward movements with threshold 0.9: 0

Quadratic Discriminant Analysis

Quadratic discriminant analysis is a generalization of linear discriminant analysis as a classifier, but it does not make the same covariance assumption.

Fitting a QDA Model

Here, we’ll be using scikit-learn’s Quadratic Discriminant Analysis class to fit our model:

qda = QuadraticDiscriminantAnalysis()
qdaFit = qda.fit(X_train, y_train);
printPriorProbabilities(qdaFit.classes_, qdaFit.priors_)
printGroupMeans(qdaFit.classes_, featuresNames, qdaFit.means_)

Prior probablities of groups:

Down	Up
0.491984	0.508016

Group means:

	Lag1	Lag2
Down	0.042790	0.033894
Up	-0.039546	-0.031325

QDA Predictions

y_hat = qdaFit.predict(X_test)
confusionDF = pd.crosstab(y_hat, y_test,
                          rownames=['Predicted'], colnames=['Actual'],
                          margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted market movements:")
print(np.mean(y_test == y_hat))

Confusion matrix:

Actual	Down	Up	All
Predicted
Down	30	20	50
Up	81	121	202
All	111	141	252

Share of correctly predicted market movements:

0.5992063492063492

k-Nearest Neighbors

Here, we’ll be looking at k-nearest neighbors, which we talked about in lecture 02 of this course. Tutorial 02 was also on k-nearest neighbors classification, so please refer to that tutorial for an additional examples and explanations.

One Neighbor

knn = neighbors.KNeighborsClassifier(n_neighbors=1)
y_hat = knn.fit(X_train, y_train).predict(X_test)
confusionDF = pd.crosstab(y_hat, y_test,
                          rownames=['Predicted'], colnames=['Actual'],
                          margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted market movements:")
print(np.mean(y_test == y_hat))

Confusion matrix:

Actual	Down	Up	All
Predicted
Down	43	58	101
Up	68	83	151
All	111	141	252

Share of correctly predicted market movements:

0.5

Three Neighbors

knn = neighbors.KNeighborsClassifier(n_neighbors=3)
y_hat = knn.fit(X_train, y_train).predict(X_test)
confusionDF = pd.crosstab(y_hat, y_test,
                          rownames=['Predicted'], colnames=['Actual'],
                          margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted market movements:")
print(np.mean(y_test == y_hat))

Confusion matrix:

Actual	Down	Up	All
Predicted
Down	48	55	103
Up	63	86	149
All	111	141	252

Share of correctly predicted market movements:

0.5317460317460317

An Application to Caravan Insurance Data

This section will demonstrate the use of two techniques we learned above, KNN and logit.

A New Dataset

We’ll be using a new dataset that contains information on customers of an insurance company. You can see a detailed description of this dataset here.

Loading Our Dataset

caravan = pd.read_csv('Caravan.csv', index_col=0)

display(caravan.describe())
display(caravan.describe(include=[np.object]))

	MOSTYPE	MAANTHUI	MGEMOMV	MGEMLEEF	MOSHOOFD	MGODRK	MGODPR	MGODOV	MGODGE	MRELGE	...	ALEVEN	APERSONG	AGEZONG	AWAOREG	ABRAND	AZEILPL	APLEZIER	AFIETS	AINBOED	ABYSTAND
count	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	...	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000	5822.000000
mean	24.253349	1.110615	2.678805	2.991240	5.773617	0.696496	4.626932	1.069907	3.258502	6.183442	...	0.076606	0.005325	0.006527	0.004638	0.570079	0.000515	0.006012	0.031776	0.007901	0.014256
std	12.846706	0.405842	0.789835	0.814589	2.856760	1.003234	1.715843	1.017503	1.597647	1.909482	...	0.377569	0.072782	0.080532	0.077403	0.562058	0.022696	0.081632	0.210986	0.090463	0.119996
min	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	10.000000	1.000000	2.000000	2.000000	3.000000	0.000000	4.000000	0.000000	2.000000	5.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	30.000000	1.000000	3.000000	3.000000	7.000000	0.000000	5.000000	1.000000	3.000000	6.000000	...	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	35.000000	1.000000	3.000000	3.000000	8.000000	1.000000	6.000000	2.000000	4.000000	7.000000	...	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	41.000000	10.000000	5.000000	6.000000	10.000000	9.000000	9.000000	5.000000	9.000000	9.000000	...	8.000000	1.000000	1.000000	2.000000	7.000000	1.000000	2.000000	3.000000	2.000000	2.000000

8 rows × 85 columns

	Purchase
count	5822
unique	2
top	No
freq	5474

Standardizing Our Data

y = caravan.Purchase
X = caravan.drop('Purchase', axis=1).astype('float64')
X_scaled = preprocessing.scale(X)

Splitting Data into Train and Test Data

X_train = X_scaled[1000:,:]
y_train = y[1000:]
X_test = X_scaled[:1000,:]
y_test = y[:1000]

Using KNN for Prediction

knn = neighbors.KNeighborsClassifier(n_neighbors=1)
y_hat = knn.fit(X_train, y_train).predict(X_test)
confusionDF = pd.crosstab(y_hat, y_test,
                          rownames=['Predicted'], colnames=['Actual'],
                          margins=True)
displaybd("Confusion matrix:")
display(confusionDF)
displaybd("Share of correctly predicted purchases:")
print(np.mean(y_test == y_hat))

Confusion matrix:

Actual	No	Yes	All
Predicted
No	873	50	923
Yes	68	9	77
All	941	59	1000

Share of correctly predicted purchases:

0.882

Logit

X_train_w_constant = sm.add_constant(X_train)
X_test_w_constant = sm.add_constant(X_test, has_constant='add')

y_train_code = np.where(y_train == "No", 0, 1)

res = sm.GLM(y_train_code, X_train_w_constant, family=sm.families.Binomial()).fit()
y_hat_code = res.predict(X_test_w_constant)
PurchaseHat = np.where(y_hat_code > 0.25, "Yes", "No")

confusionDF = pd.crosstab(PurchaseHat, y_test,
                          rownames=['Predicted'], colnames=['Actual'],
                          margins=True)
displaybd("Confusion matrix:")
display(confusionDF)

Confusion matrix:

Actual	No	Yes	All
Predicted
No	919	48	967
Yes	22	11	33
All	941	59	1000

More Iris Classification

Here, we will apply some of the new techniques we learned above to the iris classification problem we explored using k-nearest neighbors in Tutorial 02.

Our Dataset

I’ve included some of the important descriptions from Tutorial 02 in this tutorial as well, but please review tutorial 02 for more details on how we initially set up and process our dataset.
As a reminder, we are using the iris data set from the University of California, Irvine and are attempting to classify types of irises using the following four attributes:

Sepal length
Sepal width
Petal length
Petal width

There are three types of irises:

Iris Setosa
Iris Versicolor
Iris Virginica

Importing Our Dataset

Let’s import the data set as a pandas dataframe:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']
iris_df = pd.read_csv(url, names=names)

Splitting Data into Train and Test Data

Let’s first define our \(X\) and \(y\) variables:

X = iris_df.iloc[:, :-1] #attributes, iloc[:, :-1] means until the last column
y = iris_df['type'] #labels

Now, et’s split our data into 80% training data and 20% testing data. We can do this using train_test_split and its train_size parameter:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80)

Feature Scaling

Now, we want to perform some feature scaling to normalize the range of our independent variables.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Logit

Let’s first take a look at how we might apply logit to our iris classification problem. You may apply logit using the techniques we learned above (using GLM), but I will show you one other method we can employ using scikit-learn’s Logistic Regression class, as we can consider logit and logistic regression to be the same thing

Fitting Our Model

Let’s import the Logistic Regression class and fit our model as follows:

from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression()
logit_model.fit(X_train, y_train)

Making Predictions

Then, we’ll make some predictions and store them in a variable called y_pred:

y_pred = logit_model.predict(X_test)

Evaluating Our Predictions

Like we did in Tutorial 02, let’s make a classification report and confusion matrix.

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

	precision	recall	f1-score	support
Iris-setosa	1.00	1.00	1.00	9
Iris-versicolor	1.00	0.70	0.82	10
Iris-virginica	0.79	1.00	0.88	11

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
                     index = ['setosa','versicolor','virginica'], 
                     columns = ['setosa','versicolor','virginica'])

sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

logitheat

Using this heat map, we can make the following observations:

All setosa flowers were correctly classified by our model.
Seven versicolor flowers were correctly classified, and three versicolor flowers were incorrectly classified as virginica flowers.
All virginica flowers were correctly classified by our model.

Again, you may not get the same exact classification report or confusion matrix, but this is normal, as your results will vary each time you run your model.

Linear Discriminant Analysis

Let’s now try using linear discriminant analysis for our classification.

Fitting Our Model

Again, let’s use the Linear Discriminant Analysis class to fit our model:

lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)

Making Predictions

Then, we’ll make some predictions and store them in a variable called y_pred:

y_pred = lda_model.predict(X_test)

Evaluating Our Predictions

Like we did in Tutorial 02, let’s make a classification report and confusion matrix. If you want, you can also use the functions printPriorProbabilities(), printGroupMeans(), and printLDACoeffs() that we wrote earlier, but here I’ll keep it simple and just look at our classification report and heatmap like we did just earlier.

print(classification_report(y_test, y_pred))

	precision	recall	f1-score	support
Iris-setosa	1.00	1.00	1.00	9
Iris-versicolor	1.00	1.00	1.00	10
Iris-virginica	1.00	1.00	1.00	11

In thise case, we can see our model did very well. Let’s also take a look at the heatmap to see that a little bit more easily:

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
                     index = ['setosa','versicolor','virginica'], 
                     columns = ['setosa','versicolor','virginica'])

sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

ldaheat

Using this heat map, we can make the following observations:

All setosa flowers were correctly classified by our model.
All versicolors were correctly classified by our model.
All virginica flowers were correctly classified by our model.

Again, you may not get the same exact classification report or confusion matrix, but this is normal, as your results will vary each time you run your model.

Quadratic Discriminant Analysis

Let’s now try using quadratic discriminant analysis for our classification.

Fitting Our Model

Again, let’s use the Linear Discriminant Analysis class to fit our model:

qda_model = QuadraticDiscriminantAnalysis()
qda_model.fit(X_train, y_train)

Making Predictions

Then, we’ll make some predictions and store them in a variable called y_pred:

y_pred = qda_model.predict(X_test)

Evaluating Our Predictions

Like we did in Tutorial 02, let’s make a classification report and confusion matrix. If you want, you can also use the functions printPriorProbabilities() and printGroupMeans() that we wrote earlier, but here I’ll keep it simple and just look at our classification report and heatmap like we did just earlier.

print(classification_report(y_test, y_pred))

	precision	recall	f1-score	support
Iris-setosa	1.00	1.00	1.00	9
Iris-versicolor	1.00	0.80	0.89	10
Iris-virginica	0.85	1.00	0.92	11

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
                     index = ['setosa','versicolor','virginica'], 
                     columns = ['setosa','versicolor','virginica'])

sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

qdaheat

Using this heat map, we can make the following observations:

All setosa flowers were correctly classified by our model.
Eight versicolor flowers were correctly classified, and two versicolor flowers were incorrectly classified as virginica flowers.
All virginica flowers were correctly classified by our model.

Again, you may not get the same exact classification report or confusion matrix, but this is normal, as your results will vary each time you run your model.

Classification

Table of contents

Getting Started

Importing Libraries

Plot and Output Settings

Our Dataset

Logit

Running Logit via GLM

Predicted Probabilities and Confusion Matrix

Estimation of Test Error

Linear Discriminant Analysis

Custom Output Functions

Fitting an LDA Model

LDA Predictions

Posterior Probabilities

Quadratic Discriminant Analysis

Fitting a QDA Model

QDA Predictions

k-Nearest Neighbors

One Neighbor

Three Neighbors

An Application to Caravan Insurance Data

A New Dataset

Loading Our Dataset

Standardizing Our Data

Splitting Data into Train and Test Data

Using KNN for Prediction

Logit

More Iris Classification

Our Dataset

Importing Our Dataset

Splitting Data into Train and Test Data

Feature Scaling

Logit

Fitting Our Model

Making Predictions

Evaluating Our Predictions

Linear Discriminant Analysis

Fitting Our Model

Making Predictions

Evaluating Our Predictions

Quadratic Discriminant Analysis

Fitting Our Model

Making Predictions

Evaluating Our Predictions