Link

k-Nearest Neighbors Classification

Table of contents

  1. Importing Libraries
  2. Importing Our Data Set
  3. Preprocessing
    1. Defining Attributes and Labels
    2. Splitting Training Data and Testing Data
    3. Feature Scaling
  4. Fitting Data and Predicting Data
  5. Evaluating our Predictions
    1. Generating a Classification Report
    2. Visualizing our Predictions

You can view the code for this tutorial here.

Importing Libraries

As usual, we will want to use numpy, pandas, matplot, and seaborn to help us manipulate and visualize our data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Additionally, we will want to use scikit-learn to apply the KNN algorithm on our data set. This is the beauty of libraries such as scikit-learn; we do not have to worry about the details of the algorithm’s implementation and can simply use the functions the libraries provide. Later in this tutorial, we’ll import functions from scikit-learn as we need.

We’ll be able to use scikit-learn to help us with both our classification problems and our regression problems.

Importing Our Data Set

For examples of KNN, the iris data set from the University of California, Irvine is often used.

In this tutorial, we will be attempting to classify types of irises using the following four attributes:

  1. Sepal length
  2. Sepal width
  3. Petal length
  4. Petal width

There are three types of irises:

  1. Iris Setosa
  2. Iris Versicolor
  3. Iris Virginica

Let’s import the data set as a pandas dataframe:

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

Let’s take a small look at what our data set looks like, using df.head(), which shows us the first five rows of our data set:

 5.13.51.40.2Iris-setosa
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.63.11.50.2Iris-setosa
35.03.61.40.2Iris-setosa
45.43.91.70.4Iris-setosa

Preprocessing

Notice that our data set does not have proper column names. Thus, we actually need to add the columm names on our own so that our dataframe is easier to read and work with. Let’s try importing our data set one more time, using names from read_csv:

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type'])

To make our code a bit easier to read, let’s write the following instead:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']
df = pd.read_csv(url, names=names)

Now, let’s take a look at our dataframe using df.head() again:

 sepal-lengthsepal-widthpetal-lengthpetal-widthtype
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa

Defining Attributes and Labels

Now, we need to split our data set into its attributes and labels, using iloc:

X = df.iloc[:, :-1] #attributes, iloc[:, :-1] means until the last column
y = df['type'] #labels

Splitting Training Data and Testing Data

Let’s split our data into 80% training data and 20% testing data. We can do this using train_test_split and its train_size parameter:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80)

Feature Scaling

Now, we want to perform some feature scaling to normalize the range of our independent variables.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Fitting Data and Predicting Data

Now, we’re finally ready to import the KNN classifier algorithm from scikit_learn:

from sklearn.neighbors import KNeighborsClassifier

Now, we need to choose a K value for our classifer. As we learned in class, there are pros and cons to choosing a higher or lower K value. For now, let’s start out with 5, as this is a common initial value to work with:

classifier = KNeighborsClassifier(n_neighbors=5)

Finally, we can fit our model using our training data, and then make our first predictions using this model:

classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

Now, our model can take attributes (sepal-length, sepal-width, petal-length, and petal-width) and predict which type of iris it is. We are doing this using our test data, X_test.

Evaluating our Predictions

We do not have too many data points, so we can first just compare our predictions with our test data visually:

print("predicted: ", y_pred)
print("actual: ", y_test)
PredictedActual
Iris-versicolorIris-versicolor
Iris-versicolorIris-versicolor
Iris-virginicaIris-virginica
Iris-versicolorIris-virginica
Iris-virginicaIris-virginica
Iris-versicolorIris-virginica
Iris-versicolorIris-versicolor
Iris-virginicaIris-virginica
Iris-setosaIris-setosa
Iris-setosaIris-setosa
Iris-versicolorIris-versicolor
Iris-setosaIris-setosa
Iris-virginicaIris-virginica
Iris-virginicaIris-virginica
Iris-setosaIris-setosa
Iris-versicolorIris-versicolor
Iris-virginicaIris-virginica
Iris-versicolorIris-versicolor
Iris-virginicaIris-virginica
Iris-versicolorIris-versicolor
Iris-setosaIris-setosa
Iris-virginicaIris-virginica
Iris-setosaIris-setosa
Iris-setosaIris-setosa
Iris-setosaIris-setosa
Iris-setosaIris-setosa
Iris-setosaIris-setosa
Iris-setosaIris-setosa
Iris-virginicaIris-virginica
Iris-setosaIris-setosa

Because the way we split our data will be different each time, you may get different results, but the above table is shown just to give you an idea of how well our classification algorithm works.

Generating a Classification Report

We still would like to evaluate our algorithm numerically, rather than just visually. Especially for larger data sets, looking at the type of table above becomes impossible. Let’s take a look at the confusion matrix and classification report using the following code:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

We then get the following classification report:

 precisionrecallf1-scoresupport
Iris-setosa1.001.001.0012
Iris-versicolor0.781.000.887
Iris-virginica1.000.820.9011

Visualizing our Predictions

I won’t go through the details of the following code, but you should understand that it produces a heat map.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
                     index = ['setosa','versicolor','virginica'], 
                     columns = ['setosa','versicolor','virginica'])

sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

Using the above code, we get the following heat map:

heatmap

Using this heat map, we can make the following observations:

  1. All setosa flowers were correctly classified by our model.
  2. All versicolor flowers were correctly classified by our model.
  3. Nine virginica flowers were correctly classified, and two virginica flowers were incorrectly classified as versicolor flowers.

Again, your results will be slightly depending on how you split your training and test data.