In this post, I’ll be showing you how to make a diabetes classifier, which is literally a model that classifies whether a person has or does not have diabetes. We’ll be classifying using the KNN model first and then the Decision Tree Model in sklearn. If you do not know how KNN works, you can look here for more details.

Note: For this implementation, python modules such as sklearn, pandas, matplotlib and numpy are needed which may not come in your default installation of python. Please ensure they are downloaded before implementation of this model.

For this model, we will be using an external diabetes dataset rather than the inbuilt data set present in sklearn. The data set is in the form of an excel file and is available here. Below is a screenshot of what our data set looks like:

We begin our code by importing the needed modules and reading our training and testing data using pandas.

#Importing the usual modules of numpy, matplotlib, pandas etc.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
#Used for splitting data into training and testing
from sklearn.neighbors import KNeighborsClassifier
#KNN Model
from sklearn.tree import DecisionTreeClassifier
#Decision Tree Model
diabetes = pd.read_csv(r"C:\Users\shriya-student\Documents\machinelearning\diabetes.csv")
#Reading the dataset. Insert your own address here.

Now that we have read our data set, we have to split the data(769 rows in total) into our training and testing data. train_test_split from sklearn helps us do this.

X_train, X_test, y_train, y_test = train_test_split(diabetes.loc[:, diabetes.columns!="Outcome"],diabetes["Outcome"],stratify=diabetes["Outcome"], random_state=66)
#.loc opens the columns of the data frame diabetes. It is assigning all rows except "Outcome" to the X_train and X_test
#while it assigns the outcomes column for y_train and y_test.
#Random state in basic words is a value for how "randomized" it will be in splitting the data into training and testing.
#random state = 66 also implies that 2/3 of the data is for training and 1/3 for testing which is an appropriate combination, but this is not the case always.
#stratify means that the data is arranged using "outcomes" column as labels.

Now we will find the value of K by looping the value of K from 1 to 200 and seeing where it is the most appropriate.

#finding best k
neighbors_settings = range(1,200)
for n_neighbors in neighbors_settings:
    knn = KNeighborsClassifier(n_neighbors=n_neighbors),y_train)
    #appending accuracy of the model on both training and testing data.

plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="testing accuracy")

Below is the output of this snippet:

From the graph, it is visible that the best K is between 5 and 20. On diving in deeper, a good value of K is found to be either 9 or 19.

Now we create our KNN model and test it on our training data.

#Best is k=9
knn = KNeighborsClassifier(n_neighbors=9)
#(9,10) or (18,19) seems most appropriate.,y_train)
#Training model
print("Training Accuracy: {:.2f}".format(knn.score(X_train,y_train)))
print("Testing Accuracy: {:.2f}".format(knn.score(X_test,y_test)))
#Printing training and testing accuracy.

The output of the above code is given below:

Training Accuracy: 0.79
Testing Accuracy: 0.78

It is important to note that the KNN is a purely statistical model. It does not take into consideration which features are most important or any other details rather it simply classifies based on the trained data points. Hence, we will now perform the same classification problem using the Decision Tree Model.

Working of Decision Tree Model:

The decision tree model is one of the simplest models and can be used for classification and regression problems alike. Its setup is similar to a tree in graph theory. Using the training data, the model makes a tree setup where each node corresponds to some form of a feature. By making simple decisions, the model traverses the tree to make the classification. Eg: I have a animal classification problem and I need to classify images into lions and monkeys. The decision tree could be as follows: the root node first checks for some feature, say it is whether the image has whiskers or not. If it has, it goes along one path of the tree else it will go on another. This way each testing image proceeds till it reaches a leaf node where it is ultimately classified. Usually, a more generalized form of the Decision Tree model is preferred which goes by the name Random Forest Model. Since the decision tree model works depending on the features of the data, we can find out which features are used heavily for the classification as well.


Note: This code is just a continuation of the previous codes. The same modules as well as the variables defined will be used.

Now that we have learnt what the decision tree model is, we will use it for the same diabetes classification problem. The code for this task is below:

tree = DecisionTreeClassifier(random_state=0)
#creating model. You can also initialize values such as max_depth etc.,y_train)
#training the model
print("Training Accuracy: "+ " " + str(tree.score(X_train,y_train)))
print("Testing Accuracy: "+ " " + str(tree.score(X_test,y_test)))

Running the code, we find the accuracy to be 1.0 for training data and 0.7135416666666666 for the testing data which is significantly lower than KNN. Converting it into a Random Forest Classifier puts the model on par with KNN by increasing its accuracy.

Since we have used decision tree classifier, we can find out which features are most important in classification. We plot a bar graph of these features using tree.feature_importances_ from DecisionTreeClassifier.

#making graph of size 8x6 on y axis- x axis.
n_features = 8 #no. of features
plt.barh(range(n_features), tree.feature_importances_,align='center')
#plotting importances of all features
plt.xlabel("Feature Importance")
plt.ylim(-1, n_features)
#showing graph.

The output for the above snippet is below:

The features are indexed based on their column number in the data. From the graph, we can infer that features 1,5,6 are the most important for classifying diabetes. In the excel sheet, 1 stands for glucose, 5 for BMI and 6 for Diabetes Pedigree. Anyone with basic medical knowledge can confirm that these features are necessary for diabetes classification.

The compiled code can be viewed here:

Visits: 388