Exploration of COVID19 dataset: Time-Series daily cases plot

Recap & Grabbing Time-Series Data

First, customary importing of packages and data if you haven't already done so from the previous sections.

import pandas as pd
import matplotlib.pyplot as plt
ASEAN_countries_list = ['Brunei', 'Cambodia', 'Indonesia', 'Laos', 'Malaysia', 'Burma', 'Philippines', 'Singapore', 'Vietnam']

link = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
covid19_data = pd.read_csv(link)
asean_df = covid19_data[covid19_data['Country/Region'].isin(ASEAN_countries_list)]
asean_df.set_index('Country/Region', inplace = True)
asean_df_dropped = asean_df.drop(columns = ['Province/State', 'Lat', 'Long'])
asean_t = asean_df_dropped.T
asean_t
  

Country/Region	Brunei	Burma	Cambodia	Indonesia	Laos	Malaysia	Philippines	Singapore	Vietnam
1/22/20	0	0	0	0	0	0	0	0	0
1/23/20	0	0	0	0	0	0	0	1	2
1/24/20	0	0	0	0	0	0	0	3	2
1/25/20	0	0	0	0	0	3	0	3	2
1/26/20	0	0	0	0	0	4	0	4	2
...	...	...	...	...	...	...	...	...	...
7/30/21	336	294460	76585	3372374	5919	1095486	1580824	64861	141122
7/31/21	337	299185	77243	3409658	6299	1113272	1588965	64981	150060
8/1/21	337	302665	77914	3440396	6566	1130422	1597689	65102	157507
8/2/21	338	306354	78474	3462800	6765	1146186	1605762	65213	157507
8/3/21	338	311067	79051	3496700	7015	1163291	1612541	65315	174461

560 rows × 9 columns

Using daily cumulative cases to calculate daily increase

Previously we looked at simple plotting of cumulative cases against dates. What if we would want to look at daily cases instead? We would need to do some maths and manipulation (i promise its nothing too scary).

Imagine our dataframe for 1 country now looks like this:

Day	Cases
0	0
1	0
2	1
3	3
4	3
5	4

How we are going to do this is to make make a new dataframe which brings a row to align with the previous day's data. So we can do some quick maths later!

Day	Cases_O	Cases_ShiftF
1	0	NaN
2	1	0
3	3	1
4	3	3
5	4	3

Then by adding a new column called "difference" we can calculate the difference / increase in cases for each day!

Day	Cases_O	Cases_ShiftF	difference
1	0	NaN	NaN
2	1	0	1
3	3	1	2
4	3	3	0
5	4	3	1

We can then plot the difference!

We can try do this for Singapore first.

import numpy as np

#Make a copy so we don't actually change the previous dataframe
sg_s = asean_t['Singapore'].copy()

#Use this to shift the entire column down by 1
sg_s_F = sg_s.shift(1)

#Merge the two columns by using the same dates
sg_df = pd.concat([sg_s, sg_s_F], axis=1)

#Rename the columns for easier reference later
sg_df.columns = ['Cases_O', 'Cases_ShiftF']

#Creating a new column
sg_df['difference'] = sg_df['Cases_O'] - sg_df['Cases_ShiftF']
sg_df
  

	Cases_O	Cases_ShiftF	difference
1/22/20	0	NaN	NaN
1/23/20	1	0.0	1.0
1/24/20	3	1.0	2.0
1/25/20	3	3.0	0.0
1/26/20	4	3.0	1.0
...	...	...	...
7/30/21	64861	64722.0	139.0
7/31/21	64981	64861.0	120.0
8/1/21	65102	64981.0	121.0
8/2/21	65213	65102.0	111.0
8/3/21	65315	65213.0	102.0

560 rows × 3 columns

If we just refer to the code snippet above, we can easily make a function that does this for each country! Lets call this function country_diff() which takes in the original dataframe (asean_t) and the country of choice. In turn, make it return a new modified dataframe!

def country_diff(df, country):

    #Make a copy so we don't actually change the previous dataframe
    df_s = df[country].copy()

    #Use this to shift the entire column down by 1
    df_s_F = df_s.shift(1)

    #Merge the two columns by using the same dates
    country_df = pd.concat([df_s, df_s_F], axis=1)

    #Rename the columns for easier reference later
    country_df.columns = [f'{country}_Cases_O', f'{country}_Cases_ShiftF']

    #Creating a new column
    country_df[f'{country}_difference'] = country_df[f'{country}_Cases_O'] - country_df[f'{country}_Cases_ShiftF']
    return country_df

#Test the function
#brunei_df = country_diff(asean_t, 'Brunei')
  

Notice that our function is generally the same as our Singapore example, except we made some changes in the naming conventions (you'll see why later) to better reference them later!

We can now make use of ASEAN_countries_list to grab the daily cases of all our countries!

We can then make use of the pd.concat function again to have a dataframe consisting solely of daily cases for each country!

#Create a new dataframe
asean_daily_df = pd.DataFrame()

#Run a for loop across all countries
#Reminder:
#ASEAN_countries_list = ['Brunei', 'Cambodia', 'Indonesia', 'Laos', 
#                        'Malaysia', 'Burma', 'Philippines', 'Singapore', 'Vietnam']

for country in ASEAN_countries_list:
    #Run the function and storing it into a temporary dataframe
    #per country
    temp_df = country_diff(asean_t, country)
    
    #Attach this temporary df to the actual dataframe
    #While selecting for the "Difference" column only!
    asean_daily_df = pd.concat([asean_daily_df, temp_df[f'{country}_difference']],sort = False, axis = 1)

    
#Drop the very first row because it makes no sense to have a change between 0th day and -1th day
asean_daily_df = asean_daily_df.iloc[1:,]
asean_daily_df
  

	Brunei_difference	Cambodia_difference	Indonesia_difference	Laos_difference	Malaysia_difference	Burma_difference	Philippines_difference	Singapore_difference	Vietnam_difference
1/23/20	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	2.0
1/24/20	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0
1/25/20	0.0	0.0	0.0	0.0	3.0	0.0	0.0	0.0	0.0
1/26/20	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0
1/27/20	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...
7/30/21	3.0	668.0	41168.0	244.0	16840.0	5127.0	8537.0	139.0	7717.0
7/31/21	1.0	658.0	37284.0	380.0	17786.0	4725.0	8141.0	120.0	8938.0
8/1/21	0.0	671.0	30738.0	267.0	17150.0	3480.0	8724.0	121.0	7447.0
8/2/21	1.0	560.0	22404.0	199.0	15764.0	3689.0	8073.0	111.0	0.0
8/3/21	0.0	577.0	33900.0	250.0	17105.0	4713.0	6779.0	102.0	16954.0

559 rows × 9 columns

Edit on 15th August 2021, I found out that there is a function called pd.diff() which does this exact thing. I will now demonstrate it

# Yes thats it...
asean_daily_diff = asean_t.diff().iloc[1:,]
asean_daily_diff
  

	Brunei_difference	Cambodia_difference	Indonesia_difference	Laos_difference	Malaysia_difference	Burma_difference	Philippines_difference	Singapore_difference	Vietnam_difference
1/23/20	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	2.0
1/24/20	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0
1/25/20	0.0	0.0	0.0	0.0	3.0	0.0	0.0	0.0	0.0
1/26/20	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0
1/27/20	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...
7/30/21	3.0	668.0	41168.0	244.0	16840.0	5127.0	8537.0	139.0	7717.0
7/31/21	1.0	658.0	37284.0	380.0	17786.0	4725.0	8141.0	120.0	8938.0
8/1/21	0.0	671.0	30738.0	267.0	17150.0	3480.0	8724.0	121.0	7447.0
8/2/21	1.0	560.0	22404.0	199.0	15764.0	3689.0	8073.0	111.0	0.0
8/3/21	0.0	577.0	33900.0	250.0	17105.0	4713.0	6779.0	102.0	16954.0

559 rows × 9 columns

Plotting Daily Cases for each country

With this new dataframe, we can now make use of the plotting function in the previous section for time-series!

Note the use of f'' to easily specify the different columns :)

fig, ax = plt.subplots(figsize = (10,10))

#Iterate through our countries so we can plot automatically plot them!
for country in ASEAN_countries_list:
    asean_daily_df[f'{country}_difference'].plot(ax=ax, label = country)

ax.set_ylabel('Confirmed Daily Cases')
ax.set_xlabel('Dates')
ax.legend()
plt.show()
  

png

From the graph above, you can tell that recently, Indonesia has the highest confirmed daily cases. Followed by Malaysia and Vietnam.

The interesting thing is that over time, we can tell that Indonesia's daily cases are slowly decreasing.