k-Nearest Neighbors Classification

Importing Libraries
Importing Our Data Set
Preprocessing
Fitting Data and Predicting Data
Evaluating our Predictions
1. Generating a Classification Report
2. Visualizing our Predictions

You can view the code for this tutorial here.

Importing Libraries

As usual, we will want to use numpy, pandas, matplot, and seaborn to help us manipulate and visualize our data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Additionally, we will want to use scikit-learn to apply the KNN algorithm on our data set. This is the beauty of libraries such as scikit-learn; we do not have to worry about the details of the algorithm’s implementation and can simply use the functions the libraries provide. Later in this tutorial, we’ll import functions from scikit-learn as we need.

We’ll be able to use scikit-learn to help us with both our classification problems and our regression problems.

Importing Our Data Set

For examples of KNN, the iris data set from the University of California, Irvine is often used.

In this tutorial, we will be attempting to classify types of irises using the following four attributes:

Sepal length
Sepal width
Petal length
Petal width

There are three types of irises:

Iris Setosa
Iris Versicolor
Iris Virginica

Let’s import the data set as a pandas dataframe:

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

Let’s take a small look at what our data set looks like, using df.head(), which shows us the first five rows of our data set:

	5.1	3.5	1.4	0.2	Iris-setosa
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa

Preprocessing

Notice that our data set does not have proper column names. Thus, we actually need to add the columm names on our own so that our dataframe is easier to read and work with. Let’s try importing our data set one more time, using names from read_csv:

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type'])

To make our code a bit easier to read, let’s write the following instead:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']
df = pd.read_csv(url, names=names)

Now, let’s take a look at our dataframe using df.head() again:

	sepal-length	sepal-width	petal-length	petal-width	type
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

Defining Attributes and Labels

Now, we need to split our data set into its attributes and labels, using iloc:

X = df.iloc[:, :-1] #attributes, iloc[:, :-1] means until the last column
y = df['type'] #labels

Splitting Training Data and Testing Data

Let’s split our data into 80% training data and 20% testing data. We can do this using train_test_split and its train_size parameter:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80)

Feature Scaling

Now, we want to perform some feature scaling to normalize the range of our independent variables.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Fitting Data and Predicting Data

Now, we’re finally ready to import the KNN classifier algorithm from scikit_learn:

from sklearn.neighbors import KNeighborsClassifier

Now, we need to choose a K value for our classifer. As we learned in class, there are pros and cons to choosing a higher or lower K value. For now, let’s start out with 5, as this is a common initial value to work with:

classifier = KNeighborsClassifier(n_neighbors=5)

Finally, we can fit our model using our training data, and then make our first predictions using this model:

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

Now, our model can take attributes (sepal-length, sepal-width, petal-length, and petal-width) and predict which type of iris it is. We are doing this using our test data, X_test.

Evaluating our Predictions

We do not have too many data points, so we can first just compare our predictions with our test data visually:

print("predicted: ", y_pred)
print("actual: ", y_test)

Predicted	Actual
Iris-versicolor	Iris-versicolor
Iris-versicolor	Iris-versicolor
Iris-virginica	Iris-virginica
Iris-versicolor	Iris-virginica
Iris-virginica	Iris-virginica
Iris-versicolor	Iris-virginica
Iris-versicolor	Iris-versicolor
Iris-virginica	Iris-virginica
Iris-setosa	Iris-setosa
Iris-setosa	Iris-setosa
Iris-versicolor	Iris-versicolor
Iris-setosa	Iris-setosa
Iris-virginica	Iris-virginica
Iris-virginica	Iris-virginica
Iris-setosa	Iris-setosa
Iris-versicolor	Iris-versicolor
Iris-virginica	Iris-virginica
Iris-versicolor	Iris-versicolor
Iris-virginica	Iris-virginica
Iris-versicolor	Iris-versicolor
Iris-setosa	Iris-setosa
Iris-virginica	Iris-virginica
Iris-setosa	Iris-setosa
Iris-setosa	Iris-setosa
Iris-setosa	Iris-setosa
Iris-setosa	Iris-setosa
Iris-setosa	Iris-setosa
Iris-setosa	Iris-setosa
Iris-virginica	Iris-virginica
Iris-setosa	Iris-setosa

Because the way we split our data will be different each time, you may get different results, but the above table is shown just to give you an idea of how well our classification algorithm works.

Generating a Classification Report

We still would like to evaluate our algorithm numerically, rather than just visually. Especially for larger data sets, looking at the type of table above becomes impossible. Let’s take a look at the confusion matrix and classification report using the following code:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

We then get the following classification report:

	precision	recall	f1-score	support
Iris-setosa	1.00	1.00	1.00	12
Iris-versicolor	0.78	1.00	0.88	7
Iris-virginica	1.00	0.82	0.90	11

Visualizing our Predictions

I won’t go through the details of the following code, but you should understand that it produces a heat map.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
                     index = ['setosa','versicolor','virginica'], 
                     columns = ['setosa','versicolor','virginica'])

sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

Using the above code, we get the following heat map:

heatmap

Using this heat map, we can make the following observations:

All setosa flowers were correctly classified by our model.
All versicolor flowers were correctly classified by our model.
Nine virginica flowers were correctly classified, and two virginica flowers were incorrectly classified as versicolor flowers.

Again, your results will be slightly depending on how you split your training and test data.