k-Nearest Neighbors Classification
Table of contents
- Importing Libraries
- Importing Our Data Set
- Preprocessing
- Fitting Data and Predicting Data
- Evaluating our Predictions
You can view the code for this tutorial here.
Importing Libraries
As usual, we will want to use numpy
, pandas
, matplot
, and seaborn
to help us manipulate and visualize our data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Additionally, we will want to use scikit-learn
to apply the KNN algorithm on our data set. This is the beauty of libraries such as scikit-learn
; we do not have to worry about the details of the algorithm’s implementation and can simply use the functions the libraries provide. Later in this tutorial, we’ll import functions from scikit-learn
as we need.
We’ll be able to use scikit-learn
to help us with both our classification problems and our regression problems.
Importing Our Data Set
For examples of KNN, the iris data set from the University of California, Irvine is often used.
In this tutorial, we will be attempting to classify types of irises using the following four attributes:
- Sepal length
- Sepal width
- Petal length
- Petal width
There are three types of irises:
- Iris Setosa
- Iris Versicolor
- Iris Virginica
Let’s import the data set as a pandas
dataframe:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
Let’s take a small look at what our data set looks like, using df.head()
, which shows us the first five rows of our data set:
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | |
---|---|---|---|---|---|
0 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
1 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
2 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
3 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
4 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
Preprocessing
Notice that our data set does not have proper column names. Thus, we actually need to add the columm names on our own so that our dataframe is easier to read and work with. Let’s try importing our data set one more time, using names
from read_csv
:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type'])
To make our code a bit easier to read, let’s write the following instead:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'type']
df = pd.read_csv(url, names=names)
Now, let’s take a look at our dataframe using df.head()
again:
sepal-length | sepal-width | petal-length | petal-width | type | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Defining Attributes and Labels
Now, we need to split our data set into its attributes and labels, using iloc
:
X = df.iloc[:, :-1] #attributes, iloc[:, :-1] means until the last column
y = df['type'] #labels
Splitting Training Data and Testing Data
Let’s split our data into 80% training data and 20% testing data. We can do this using train_test_split
and its train_size
parameter:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80)
Feature Scaling
Now, we want to perform some feature scaling to normalize the range of our independent variables.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Fitting Data and Predicting Data
Now, we’re finally ready to import the KNN classifier algorithm from scikit_learn
:
from sklearn.neighbors import KNeighborsClassifier
Now, we need to choose a K
value for our classifer. As we learned in class, there are pros and cons to choosing a higher or lower K
value. For now, let’s start out with 5, as this is a common initial value to work with:
classifier = KNeighborsClassifier(n_neighbors=5)
Finally, we can fit our model using our training data, and then make our first predictions using this model:
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Now, our model can take attributes (sepal-length, sepal-width, petal-length, and petal-width) and predict which type of iris it is. We are doing this using our test data, X_test
.
Evaluating our Predictions
We do not have too many data points, so we can first just compare our predictions with our test data visually:
print("predicted: ", y_pred)
print("actual: ", y_test)
Predicted | Actual |
---|---|
Iris-versicolor | Iris-versicolor |
Iris-versicolor | Iris-versicolor |
Iris-virginica | Iris-virginica |
Iris-versicolor | Iris-virginica |
Iris-virginica | Iris-virginica |
Iris-versicolor | Iris-virginica |
Iris-versicolor | Iris-versicolor |
Iris-virginica | Iris-virginica |
Iris-setosa | Iris-setosa |
Iris-setosa | Iris-setosa |
Iris-versicolor | Iris-versicolor |
Iris-setosa | Iris-setosa |
Iris-virginica | Iris-virginica |
Iris-virginica | Iris-virginica |
Iris-setosa | Iris-setosa |
Iris-versicolor | Iris-versicolor |
Iris-virginica | Iris-virginica |
Iris-versicolor | Iris-versicolor |
Iris-virginica | Iris-virginica |
Iris-versicolor | Iris-versicolor |
Iris-setosa | Iris-setosa |
Iris-virginica | Iris-virginica |
Iris-setosa | Iris-setosa |
Iris-setosa | Iris-setosa |
Iris-setosa | Iris-setosa |
Iris-setosa | Iris-setosa |
Iris-setosa | Iris-setosa |
Iris-setosa | Iris-setosa |
Iris-virginica | Iris-virginica |
Iris-setosa | Iris-setosa |
Because the way we split our data will be different each time, you may get different results, but the above table is shown just to give you an idea of how well our classification algorithm works.
Generating a Classification Report
We still would like to evaluate our algorithm numerically, rather than just visually. Especially for larger data sets, looking at the type of table above becomes impossible. Let’s take a look at the confusion matrix and classification report using the following code:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
We then get the following classification report:
precision | recall | f1-score | support | |
---|---|---|---|---|
Iris-setosa | 1.00 | 1.00 | 1.00 | 12 |
Iris-versicolor | 0.78 | 1.00 | 0.88 | 7 |
Iris-virginica | 1.00 | 0.82 | 0.90 | 11 |
Visualizing our Predictions
I won’t go through the details of the following code, but you should understand that it produces a heat map.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,
index = ['setosa','versicolor','virginica'],
columns = ['setosa','versicolor','virginica'])
sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Using the above code, we get the following heat map:
Using this heat map, we can make the following observations:
- All setosa flowers were correctly classified by our model.
- All versicolor flowers were correctly classified by our model.
- Nine virginica flowers were correctly classified, and two virginica flowers were incorrectly classified as versicolor flowers.
Again, your results will be slightly depending on how you split your training and test data.