This post will cover the basics of data visualisation with an example project. The best way to learn is by doing it yourself!
Recently some of my friends and I participated in a statistics/data science competition called ‘segfault’. In this we were given a dataset of 4 lakh records of COVID-19 patients in an imaginary city of 400 sq km. Our goal was to draw valuable insights from this dataset using various visualisation techniques.
We were given two datasets with the following information in them:
The first dataset consisted of the population in each square kilometre of this city. By giving the population in one square kilometre, it gave the population density of that area.
The second dataset consisted of information of each case in this city.
It had various parameters mentioned for each case. The different parameters, their type and their explanation is given below:
• ‘Time of Infection’: This is the day that the patient is infected. The possible values it could take are ‘0’, ‘1’, . . ., ‘239’.
• ‘Time of Reporting’: This is the day that the patient reports their infection. The possible values it could take are ‘0’, ‘1’, . . ., ‘246’.
• ‘x location’: This is the x coordinate of the patient’s residential area.
• ‘y location’: This is the y coordinate of the patient’s residential area.
• ‘Age’: Age of the patient.
• ‘Diabetes’: This indicates whether the patient has Diabetes; ‘True’ means they have it, ‘False’ means they don’t.
• ‘Respiratory Illnesses’: This indicates whether the patient has any chronic Respiratory Illnesses; ‘True’ means they have, ‘False’ means they don’t.
• ‘Abnormal Blood Pressure’: This indicates whether the patient has Abnormal (High or Low) Blood Pressure; ‘True’ means they have, ‘False’ means they don’t.
• ‘Outcome’: This is the outcome of the case. ‘Dead’ means the patient died. ‘Alive’ means the patient survived.
The problem statement:
Descriptive Analytics on the data provided to gain key insights.
Now, let me tell you what we did.
We created a bunch of different graphs to find relationships between variables, population densities, hotspots, movement of hotspots, heatmaps and much more.
In this post we will look at only some of the easier ones like the relationship between some variables. The hotspots and heatmaps will be taken up in a later post, as this is just an introduction.
I will link our presentation at the end of this post so that you can refer to that for more info on what we did.
What libraries will we need?
For what we are about to do, we will need mainly 2 libraries:
- Pandas: to read and perform operations on the dataset.
- Matplotlib: To plot graphs and other visuals.
Let’s start by trying to find a relationship between the age of a person and their death rate.
First of all, let us think about what kind of graph we can plot for this.
One thing that comes to mind is a bar graph where the x-axis says the age and the y-axis says the respective death rate.
Now, we can put all the values of age on the x-axis but can we come up with a better idea?
Instead of plotting the value for each age from 0-80, we can plot age ranges. For example, 0-10, 10-20 etc. Plotting age ranges is better since it makes the graph look less cluttered while still conveying basically the same information.
Also, another thing to consider is whether to plot death percentages or their absolute values.
Here is a histogram I plotted for the distribution of people among different ages for this dataset:
As we can see, the distribution forms an almost perfect bell-curve normalized at around 40 years.
This means that most of the population is not evenly distributed among the ages and 40 years has the most people.
Thus, if we use the absolute death rate instead of percentage, it might be misleading due to the uneven age distribution.
Now that we have thought enough about it, let’s start writing some code!
I have written the code in a jupyter notebook that I will show in this section.
Firstly, let’s import all the relevant libraries:
import pandas as pd import math import matplotlib.pyplot as plt
Next, Let’s read the dataset and store it into a pandas dataframe (Make sure the dataset is in the same folder as your code):
covid_df = pd.read_csv('COVID_Dataset.csv')
Just to see what we are working with, display the top 5 rows of the dataframe:
This is what gets displayed:
As we decided, we must create another category in the dataframe that gives the age-group of that person:
covid_df['Age_Cat'] = "" for i in range(len(covid_df['Age_Cat'])): x = int(covid_df['Age'][i]) if x%10 == 0: t = str(x)+'-'+str(x+10) else: t = str(math.floor(x/10)*10)+'-'+str(math.ceil(x/10)*10) covid_df.at[i,'Age_Cat'] = t
This simple piece of code adds another category to the dataframe called ‘Age_Cat’ and classifies the person into their age group based on their age. The code is pretty self-explanatory so we can just move on to the next step.
Now, we use a function called crosstab. This function creates a frequency tables of the 2 factors that are inputted in the function. Now, this may sound complicated but the way we use it now will clear a lot of doubts.
We will create a crosstab of the ‘Age_Cat’ group and the outcome group. Essentially this means it will give the frequency of Dead/Alive for each age category.
This is the code for it:
temp_df = pd.crosstab(covid_df.Age_Cat,covid_df.Outcome) temp_df
This creates a crosstab of the variables and stores the dataframe on a new variable called temp_df. This is how the new dataframe looks:
Now, we have the absolute values for the frequencies for Dead/Alive for each age category.
All we have left is to convert it to percentage values.
The code for this is given below and is pretty simple. First we create columns called ‘Dead%’ and ‘Alive%’ and for each age group enter the percentage of Dead and Alive. Alive% is calculated by the formula:
(Number Alive/ Total cases in age group) * 100
Dead% is just 100 – Alive%
The code for this:
temp_df['Alive%'] = 0.0 for i in range(len(temp_df['Alive'])): m = str(i*10)+'-'+str((i*10)+10) x = temp_df['Alive'][i]+temp_df['Dead'][i] y = temp_df['Alive'][i] t = y/x temp_df.at[m,'Alive%'] = float(str(t*100)[:5]) temp_df['Dead%']=0.0 for i in range(len(temp_df['Alive'])): m = str(i*10)+'-'+str((i*10)+10) temp_df.at[m,'Dead%'] = 100-temp_df['Alive%'][i]
Next, we must remove all the extra columns, i.e., ‘Alive’, ‘Dead’, and ‘Alive%’. Now, all we have left in the dataframe are the death rates for each age group.
temp_df.drop('Dead', inplace=True, axis = 1) temp_df.drop('Alive', inplace=True, axis = 1) temp_df.drop('Alive%', inplace=True, axis = 1)
All that’s left to do now is to plot the graph. We use the following code:
ax = temp_df.plot(kind='bar',color = 'orange',title = 'Age Vs Fatality') print(ax)
The following graph appears:
A beautiful correlation is seen with the 80-90 age category with the highest death rate and 20-30 with the lowest.
Many valuable insights can be drawn from this.
I will be attaching the datasets and our presentation in this post.
I challenge you to make a similar plot for Diseases vs Fatality rate. Do it on your own, because that’s the best way to learn!
A continuation of this is coming soon with more complicated plots.