Note: This post requires you to have read my previous post about data visualisation in python as it explains important concepts such as the use of matplotlib.pyplot plotting tool and an introduction to the Iris dataset, which is what we will train our model on. The link is given below. The code in this post requires the modules scikit-learn, scipy and numpy to be installed.

What is a k-NN classifier?

A k-NN classifier stands for a k-Nearest Neighbours classifier. It is one of the simplest machine learning algorithms used to classify a given set of features to the class of the most frequently occurring class of its k-nearest neighbours of the dataset. Let us try to illustrate this with a diagram:

In this example, let us assume we need to classify the black dot with the red, green or blue dots, which we shall assume correspond to the species setosa, versicolor and virginica of the iris dataset. If we set the number of neighbours, k, to 1, it will look for its nearest neighbour and seeing that it is the red dot, classify it into setosa. If we set k as 3, it expands its search to the next two nearest neighbours, which happen to be green. Since the number of green is greater than the number of red dots, it is then classified into green, or versicolor. If we further increase the value of k to 7, it looks for the next 4 nearest neighbours. Since the number of blue dots(3) is higher than that of either red(2) or green(2), it is assigned the class of the blue dots, virginica.

Building and Training a k-NN Classifier in Python Using scikit-learn

To build a k-NN classifier in python, we import the KNeighboursClassifier from the sklearn.neighbours library. We then load in the iris dataset and split it into two – training and testing data (3:1 by default). Splitting the dataset lets us use some of the data to test and measure the accuracy of the classifier. After splitting, we fit the classifier to the training data after setting the number of neighbours we consider. We can then make predictions on our data and score the classifier. Scoring the classifier helps us understand the percentage of the testing data it classified correctly. Since we already know the classes and tell the machine the same, k-NN is an example of a supervised machine learning algorithm. The code to train and predict using k-NN is given below:

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import numpy
from sklearn.model_selection import train_test_split

irisDataset = load_iris()
Data_train, Data_test, Target_train, Target_test = train_test_split(irisDataset["data"],
                                                        irisDataset["target"], random_state=0)

data_to_predict = numpy.array([[5, 2.9, 1, 0.2]])

knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(Data_train, Target_train)

predicted_species = knn.predict(data_to_predict)
print("The most probable species the sample will be is " + irisDataset["target_names"][predicted_species][0] + ". " +
      "The machine was tested and got {:.2f}% of its predictions right.".format(knn.score(Data_test, Target_test) * 100))
The output obtained upon running the above code. It classifies our data as setosa and scores the classifier at 0.9737…., which we convert to a percentage and round off to the nearest hundredth, giving an accuracy percentage of 97.37%

Also try changing the n_neighbours parameter values to 19, 25, 31, 43 etc. What happens to the accuracy then?

How to Find the Optimal Value of k?

While assigning different values to k, we notice that different values of k give different accuracy rates upon scoring. So, how do we find the optimal value of k?

One way to do this would be to have a for loop that goes through values from 1 to n, and keep setting the value of k to 1,2,3…..n and score for each value of k. We can then compare the accuracy of each value of k and then choose the value of k we want. Run the following code to do so:

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import numpy
from sklearn.model_selection import train_test_split

irisDataset = load_iris()
Data_train, Data_test, Target_train, Target_test = train_test_split(irisDataset["data"], irisDataset["target"],
                                                                    random_state=0)
scores = []
val_k = []

for k in range(1, 100):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(Data_train, Target_train)
    val_k.append(k)
    scores.append(knn.score(Data_test, Target_test)*100)
    print("When k is", k, "the accuracy is", knn.score(Data_test, Target_test)*100)

Hard to read through the output, isn’t it? A smarter way to view the data would be to represent it in a graph. Here’s where data visualisation comes in handy. We use the matplotlib.pyplot.plot() method to create a line graph showing the relation between the value of k and the accuracy of the model. The following code does everything we have discussed in this post – fit, predict, score and plot the graph:

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import numpy
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plot

irisDataset = load_iris()

#splitting the dataset
Data_train, Data_test, Target_train, Target_test = train_test_split(irisDataset["data"], irisDataset["target"],
                                                                    random_state=0)
x_coords = []
y_coords = []

data_to_predict = numpy.array([[5, 2.9, 1, 0.2]])

#checking accuracy for various values of k
for k in range(1, 100):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(Data_train, Target_train)
    x_coords.append(k)
    y_coords.append(knn.score(Data_test, Target_test)*100)

#training a knn and making a prediction
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(Data_train, Target_train)

predicted_species = knn.predict(data_to_predict)
print("The most probable species the sample will be is " + irisDataset["target_names"][predicted_species][0] + ". " +
      "The machine was tested and got {:.2f}% of its predictions right.".format(knn.score(Data_test, Target_test) * 100))

#plotting the graph
plot.figure()
plot.plot(x_coords, y_coords)
plot.xlabel("Value of k")
plot.ylabel("Accuracy (in %)")
plot.ylim(top = 100)
plot.title("Accuracy for different values of k")
plot.show()
The graph obtained upon running the above code. The model also prints out the exact same result as in the first code of the post.

From the graph, we can see that the accuracy remains pretty much the same for k-values 1 through 23 but then starts to get erratic and significantly less accurate.

Underfitting and Overfitting

For a k-NN model, choosing the right value of k – neither too big nor too small – is extremely important. If we choose a value of k that is way too small, the model starts to make inaccurate predictions and is said to be overfit. Underfitting is caused by choosing a value of k that is too large – it goes against the basic principle of a kNN classifier as we start to read from values that are significantly far off from the data to predict. These lead to either large variations in the imaginary “line” or “area” in the graph associated with each class (called the decision boundary), or little to no variations in the decision boundaries, and predictions get too good to be true, in a manner of speaking. These phenomenon are most noticed in larger datasets with fewer features. We can notice the phenomenon of underfitting in the above graph. An underfit model has almost straight-line decision boundaries and an overfit model has irregularly shaped decision boundaries. The ideal decision boundaries are mostly uniform but following the trends in data.

To illustrate the change in decision boundaries with changes in the value of k, we shall make use of the scatterplot between the sepal length and sepal width values. We shall train a k-NN classifier on these two values and visualise the decision boundaries using a colormap, available to us in the matplotlib.colors module. Run the following code to plot two plots – one to show the change in accuracy with changing k values and the other to plot the decision boundaries. Also, note how the accuracy of the classifier becomes far lower when fitting without two features using the same test data as the classifier fitted on the complete iris dataset.

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import numpy
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plot
from sklearn.model_selection import cross_val_score
from matplotlib.colors import ListedColormap

irisDataset = load_iris()
sepals = []
sepal_length = irisDataset.data[:, 0]
sepal_width = irisDataset.data[:, 1]
x_coords = []
y_coords = []
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])


def listify(list1, list2):
    for i in range(0, len(list1)):
        temp = [list1[i], list2[i]]
        sepals.append(temp)

listify(sepal_length, sepal_width)
sepals = numpy.array(sepals)

Data_train, Data_test, Target_train, Target_test = train_test_split(sepals,
                                                                    irisDataset["target"], random_state=0)

formatterplot = plot.FuncFormatter(lambda index, *args: irisDataset.target_names[int(index)])

for k in range(1, 45):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(Data_train, Target_train)
    x_coords.append(k)
    y_coords.append(knn.score(Data_test, Target_test)*100)

plot.figure(1)
plot.plot(x_coords,y_coords)
plot.ylim(top = 100)
plot.xlabel("Value of k")
plot.ylabel("Accuracy")
plot.title("Accuracy for different values of k")

#calculating min and max lengths/widths of the features
length_min, length_max = min(sepal_length) - 0.1, max(sepal_length) + 0.1
width_min, width_max = min(sepal_width) - 0.1, max(sepal_width) + 0.1
#numpy.meshgrid() used to return coordinate matrices
xx, yy = numpy.meshgrid(numpy.linspace(length_min, length_max, 100), numpy.linspace(width_min, width_max, 100))
#training the dataset. change n_neighbours value to view the decision boundary change
knn_plotter = KNeighborsClassifier(n_neighbors=8)
knn_plotter.fit(Data_train, Target_train)
#predicting the decision boundaries relative to the dots
Z = knn_plotter.predict(numpy.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plot.figure(2)
plot.pcolormesh(xx, yy, Z, cmap = cmap_light)
plot.scatter(sepal_length, sepal_width, c=irisDataset.target, cmap=cmap_bold)
plot.xlabel("sepal length (cm)")
plot.ylabel("sepal width (cm)")
plot.colorbar(ticks=[0, 1, 2], format=formatterplot)
plot.show()
The first graph obtained as output.

In the above plots, if the data to be predicted falls in the red region, it is assigned setosa. Green corresponds to versicolor and blue corresponds to virgininca. Note that these are not the decision boundaries for a k-NN classifier fitted to the entire iris dataset as that would be plotted on a four-dimensional graph, one dimension for each feature, making it impossible for us to visualise.


The github links for the above programs are:

https://github.com/adityapentyala/Python/blob/master/KNN.py

https://github.com/adityapentyala/Python/blob/master/decisionboundaries.py


Also view Saarang’s diabetes prediction model using the kNN algorithm:

Visits: 134

One Reply to “Machine Learning: k-NN Classifier in Python”

Leave a Reply

Your email address will not be published. Required fields are marked *