NOTE: The code featured in this post requires you to have installed the python modules scikit-learn and matplotlib.
What is Data Visualisation? Why do we need it?
The representation of data in visual formats such as images, charts, graphs etc. is known as data visualisation. Data visualisation greatly helps us while programming machine learning classifiers such as a k-nearest neighbours (kNN) classifier. This is because we ourselves can then spot certain patterns or trends in the data represented in the chart. Data represented in charts, after all, is far easier to analyse than reading through expansive tables of the same data. With the growing importance of big data, data visualisation techniques help us understand increasingly larger batches of data.
An Introduction to the Iris Dataset
One of the most famous datasets from a machine-learning point of view is the Iris dataset. It documents 50 samples each of the three species of iris flowers (Iris setosa, Iris versicolor and Iris virginica) and measures four features from each species – sepal length, sepal width, petal length and petal width. It is in the form of a dictionary that consists of key pair values. The first three pairs are as follows. Firstly, the ‘data’ key, that has a value of an array of arrays of each of the 150 observation’s features stored as integers (Array(Array[int, int, int, int]). Secondly, the ‘target’ key, with a value of an array of the integers 0, 1 and 2. Thirdly, the ‘target_names’ key with a value of an array of strings, naming the three iris species. It is available in the scikit-learn module. To view and print the dataset, run the following code:
from sklearn.datasets import load_iris irisDataset = load_iris() print(irisDataset)
For more on the Iris dataset, visit https://archive.ics.uci.edu/ml/datasets/iris
Data Visualisation in Python
To aid with data visualisation in python, we will use a module called matplotlib. Matplotlib.pyplot is a powerful tool used to create 2-D charts of various types in python. In this post, we shall represent comparisons between sepal length and width and petal length and width using scatterplots. To make a scatterplot, we must use plot.scatter(x, y, color) function, which needs 2 main arguments – the x-coordinates in a list and the y-coordinates in a list. Additionally, we can specify a color argument, which we will use later on to differentiate between the three iris species. A simple scatter plot program is as follows:
import matplotlib.pyplot as plot x = [0, 1, 2, 3, 4, 5] y = [6, 7, 8, 9, 10, 11] plot.scatter(x, y) plot.show()
Visualising the Iris Dataset, Two Features at a Time
This code will plot 2 different scatterplots in separate windows. The first is a scatterplot showing the relation between sepal length and sepal width of the three species. The second shows the relation between petal length and petal width of the three species.
import matplotlib.pyplot as plot from sklearn.datasets import load_iris irisDataset = load_iris() #making individual lists for attributes sepal_length = irisDataset.data[:, 0] sepal_width = irisDataset.data[:, 1] petal_length = irisDataset.data[:, 2] petal_width = irisDataset.data[:, 3] #this variable formats the colours to match the species formatterplot = plot.FuncFormatter(lambda index, *args: irisDataset.target_names[int(index)]) #plotting first plot plot.figure(1) plot.scatter(sepal_length, sepal_width, c=irisDataset.target) plot.xlabel("sepal length (cm)") plot.ylabel("sepal width (cm)") plot.colorbar(ticks=[0, 1, 2], format=formatterplot) #plotting second plot plot.figure(2) plot.scatter(petal_length, petal_width, c=irisDataset.target) plot.xlabel("petal length (cm)") plot.ylabel("petal width (cm)") plot.colorbar(ticks=[0, 1, 2], format=formatterplot) #formats colours as per 'target' and 'target_names' keys of the dataset #showing the plots plot.show()
The github link for the above program is:
Also, you can check out my post on data visualisation in R: