NOTE: The code featured in this post requires you to have installed the python modules scikit-learn and matplotlib.

What is Data Visualisation? Why do we need it?

The representation of data in visual formats such as images, charts, graphs etc. is known as data visualisation. Data visualisation greatly helps us while programming machine learning classifiers such as a k-nearest neighbours (kNN) classifier. This is because we ourselves can then spot certain patterns or trends in the data represented in the chart. Data represented in charts, after all, is far easier to analyse than reading through expansive tables of the same data. With the growing importance of big data, data visualisation techniques help us understand increasingly larger batches of data.

An Introduction to the Iris Dataset

One of the most famous datasets from a machine-learning point of view is the Iris dataset. It documents 50 samples each of the three species of iris flowers (Iris setosa, Iris versicolor and Iris virginica) and measures four features from each species – sepal length, sepal width, petal length and petal width. It is in the form of a dictionary that consists of key pair values. The first three pairs are as follows. Firstly, the ‘data’ key, that has a value of an array of arrays of each of the 150 observation’s features stored as integers (Array(Array[int, int, int, int]). Secondly, the ‘target’ key, with a value of an array of the integers 0, 1 and 2. Thirdly, the ‘target_names’ key with a value of an array of strings, naming the three iris species. It is available in the scikit-learn module. To view and print the dataset, run the following code:

from sklearn.datasets import load_iris

irisDataset = load_iris()

print(irisDataset)
An excerpt of the output obtained.

For more on the Iris dataset, visit https://archive.ics.uci.edu/ml/datasets/iris

Data Visualisation in Python

To aid with data visualisation in python, we will use a module called matplotlib. Matplotlib.pyplot is a powerful tool used to create 2-D charts of various types in python. In this post, we shall represent comparisons between sepal length and width and petal length and width using scatterplots. To make a scatterplot, we must use plot.scatter(x, y, color) function, which needs 2 main arguments – the x-coordinates in a list and the y-coordinates in a list. Additionally, we can specify a color argument, which we will use later on to differentiate between the three iris species. A simple scatter plot program is as follows:

import matplotlib.pyplot as plot

x = [0, 1, 2, 3, 4, 5]
y = [6, 7, 8, 9, 10, 11]

plot.scatter(x, y)
plot.show()
The output obtained upon running the above code.

Visualising the Iris Dataset, Two Features at a Time

This code will plot 2 different scatterplots in separate windows. The first is a scatterplot showing the relation between sepal length and sepal width of the three species. The second shows the relation between petal length and petal width of the three species.


import matplotlib.pyplot as plot
from sklearn.datasets import load_iris

irisDataset = load_iris()

#making individual lists for attributes
sepal_length = irisDataset.data[:, 0]
sepal_width = irisDataset.data[:, 1]
petal_length = irisDataset.data[:, 2]
petal_width = irisDataset.data[:, 3]

#this variable formats the colours to match the species
formatterplot = plot.FuncFormatter(lambda index, *args: irisDataset.target_names[int(index)])

#plotting first plot
plot.figure(1)
plot.scatter(sepal_length, sepal_width, c=irisDataset.target)
plot.xlabel("sepal length (cm)")
plot.ylabel("sepal width (cm)")
plot.colorbar(ticks=[0, 1, 2], format=formatterplot)

#plotting second plot
plot.figure(2)
plot.scatter(petal_length, petal_width, c=irisDataset.target)
plot.xlabel("petal length (cm)")
plot.ylabel("petal width (cm)")
plot.colorbar(ticks=[0, 1, 2], format=formatterplot)  #formats colours as per 'target' and 'target_names' keys of the dataset

#showing the plots
plot.show()
The plots obtained as part of the output. The species are represented in different colours as shown above.

The github link for the above program is:

https://github.com/adityapentyala/Python/blob/master/plot_iris.py

Also, you can check out my post on data visualisation in R:

Visits: 267

2 Replies to “AI/ML Prerequisites: Data Visualisation in Python”

Leave a Reply

Your email address will not be published. Required fields are marked *