Worldwide Analysis

The main goal of this analysis is to provide a visual and updated overview of the worldwide Covid-19 pandemic including:

  • Overall numbers in a nutshell.
  • Interactive worldwide map of active cases.
  • Details and tree map of the top fifteen countries with most active cases.
  • Time trends for top five countries by number of active cases.
  • Comparison of the situation of the top five countries with China and South Korea where the outbreak peak has passed and the number of active cases is still going down.

Data cleaning and first observations

Import the latest reports from CSSE at Johns Hopkins University. The datasets we will use are:

  1. da_world: Worldwide data (time series).
  2. da_confirmed_w: Number of confirmed cases per country (time series).
  3. da_fatalities_w: Number of deaths per country (time series).
  4. da_recovered_w: Number of recovered cases per country (time series).


Date of the latest data to be gathered is "yesterday" to ensure the data is available as the files are updated daily at midnight.
We use the date variable "yesterday" to build the urls dynamically.

First look at the overall data.

FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key Incident_Rate Case_Fatality_Ratio
0 NaN NaN NaN Afghanistan 2021-09-14 04:21:41 33.93911 67.709953 154094 7169 NaN NaN Afghanistan 395.840141 4.652355
1 NaN NaN NaN Albania 2021-09-14 04:21:41 41.15330 20.168300 157436 2548 NaN NaN Albania 5470.706790 1.618435
2 NaN NaN NaN Algeria 2021-09-14 04:21:41 28.03390 1.659600 200301 5596 NaN NaN Algeria 456.775908 2.793795
3 NaN NaN NaN Andorra 2021-09-14 04:21:41 42.50630 1.521800 15096 130 NaN NaN Andorra 19537.953795 0.861155
4 NaN NaN NaN Angola 2021-09-14 04:21:41 -11.20270 17.873900 50738 1345 NaN NaN Angola 154.377126 2.650873

For our analysis we will use the following columns:

  • Country_Region
  • Last_Update
  • Lat
  • Long_
  • Confirmed
  • Deaths
  • Recovered
  • Active

Let's create a dataframe with these columns.

Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active
0 Afghanistan 2021-09-14 04:21:41 33.93911 67.709953 154094 7169 NaN NaN
1 Albania 2021-09-14 04:21:41 41.15330 20.168300 157436 2548 NaN NaN
2 Algeria 2021-09-14 04:21:41 28.03390 1.659600 200301 5596 NaN NaN
3 Andorra 2021-09-14 04:21:41 42.50630 1.521800 15096 130 NaN NaN
4 Angola 2021-09-14 04:21:41 -11.20270 17.873900 50738 1345 NaN NaN

We rename the columns to the values originally used in this notebook as the column names from the sources have been changing over time.

Let's have a look at the nan values and fix them.

Country/Region       0
Last_Update          0
Latitude            88
Longitude           88
Confirmed            0
Deaths               0
Recovered         3987
Active            3987
dtype: int64

After having a look at those 73 rows, all nan values are in the columns Longitude and Latitude.
First we create a list with the countries without geographical coordinates and the we define a function to add these values manually.

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Brunei', 'Bulgaria', 'Burkina Faso', 'Burma', 'Burundi',
       'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Costa Rica',
       "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Denmark',
       'Diamond Princess', 'Djibouti', 'Dominica', 'Dominican Republic',
       'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea',
       'Estonia', 'Eswatini', 'Ethiopia', 'Fiji', 'Finland', 'France',
       'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece',
       'Grenada', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana',
       'Haiti', 'Holy See', 'Honduras', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati',
       'Korea, South', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia',
       'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein',
       'Lithuania', 'Luxembourg', 'MS Zaandam', 'Madagascar', 'Malawi',
       'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands',
       'Mauritania', 'Mauritius', 'Mexico', 'Micronesia', 'Moldova',
       'Monaco', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique',
       'Namibia', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua',
       'Niger', 'Nigeria', 'North Macedonia', 'Norway', 'Oman',
       'Pakistan', 'Palau', 'Panama', 'Papua New Guinea', 'Paraguay',
       'Peru', 'Philippines', 'Poland', 'Portugal', 'Qatar', 'Romania',
       'Russia', 'Rwanda', 'Saint Kitts and Nevis', 'Saint Lucia',
       'Saint Vincent and the Grenadines', 'Samoa', 'San Marino',
       'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
       'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia',
       'Solomon Islands', 'Somalia', 'South Africa', 'South Sudan',
       'Spain', 'Sri Lanka', 'Sudan', 'Summer Olympics 2020', 'Suriname',
       'Sweden', 'Switzerland', 'Syria', 'Taiwan*', 'Tajikistan',
       'Tanzania', 'Thailand', 'Timor-Leste', 'Togo',
       'Trinidad and Tobago', 'Tunisia', 'Turkey', 'US', 'Uganda',
       'Ukraine', 'United Arab Emirates', 'United Kingdom', 'Uruguay',
       'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam',
       'West Bank and Gaza', 'Yemen', 'Zambia', 'Zimbabwe'], dtype=object)
Country/Region       0
Last_Update          0
Latitude             4
Longitude            4
Confirmed            0
Deaths               0
Recovered         3987
Active            3987
dtype: int64

We clean up the data and consolidate the names of countries with several variations or with a comma using the function format_country.
From the first look at the datasets, we found that South Korea is Korea,South in the original dataset and Congo is assigned for both the "Republic of the Congo" and the "Democratic Republic of the Congo" .

We add a column for the active positive cases.
Our main interest is to see how the numbers of active cases are changing.
Note: This column wasn't available in the original dataset when this notebook started to take shape.

Global numbers in a nutshell

Confirmed    225270139.0
Active               0.0
Recovered            0.0
Deaths         4639619.0

We group the dataset by countries to have a total value per nation and list the top five countries with active cases.

Country/Region Confirmed Active Recovered Deaths
Afghanistan 154094 0.000000 0.000000 7169
Saint Vincent and the Grenadines 2487 0.000000 0.000000 12
Nicaragua 12350 0.000000 0.000000 201
Niger 5929 0.000000 0.000000 200
Nigeria 199538 0.000000 0.000000 2619
Country/Region Confirmed Active Recovered Deaths Latitude Longitude
0 Afghanistan 154094 0.0 0.0 7169 NaN NaN
1 Saint Vincent and the Grenadines 2487 0.0 0.0 12 NaN NaN
2 Nicaragua 12350 0.0 0.0 201 NaN NaN
3 Niger 5929 0.0 0.0 200 NaN NaN
4 Nigeria 199538 0.0 0.0 2619 NaN NaN
Country/Region Confirmed Active Recovered Deaths Latitude Longitude
58 US 41221266 0.0 0.0 662106 38.9072 -77.0369
Country/Region    0
Confirmed         0
Active            0
Recovered         0
Deaths            0
Latitude          0
Longitude         0
dtype: int64

Looking at the assigned coordinates, those countries with several entries in the dataset (e.g. autonomous territories within a country or countries with entries per state) may have assigned a point far from the capital city, e.g. for Denmark, the assigned coordinates were those of Greenland. We will adjust these points with the function add_coordinates.

We create a map where each country with active cases is labeled as follows:
Blue circle: less than 1000 reported active cases.
Orange circle: more than 1000 and less than 10000 reported active cases.
Red circle: more than 10000 reported active cases.

<ipython-input-33-a4fbbc01c585>:20: RuntimeWarning:

invalid value encountered in double_scalars

Summary data per country is shown if you click on each country's circle.

Top 15 countries by number of active cases

We list the fifteen countries with most active cases.
We also calculate the death rate as number of deaths over confirmed positive cases and include it in the column "Death rate [%]".
The value of death rate can be interpreted from several perspectives and it is a controversial value as each country has a different approach to count the numbers of deaths due to Covid-19, e.g. in Italy Covid-19 post-mortem tests are done while in Germany only deaths from people tested positive when alive are counted as Covid-19 deaths.

Country/Region Confirmed Active Recovered Deaths Latitude Longitude Death rate [%]
0 Afghanistan 154094 0.0 0.0 7169 33.939110 67.709953 4.65
1 Saint Vincent and the Grenadines 2487 0.0 0.0 12 12.984300 -61.287200 0.48
2 Nicaragua 12350 0.0 0.0 201 12.865416 -85.207229 1.63
3 Niger 5929 0.0 0.0 200 17.607789 8.081666 3.37
4 Nigeria 199538 0.0 0.0 2619 9.082000 8.675300 1.31
5 North Macedonia 183705 0.0 0.0 6277 41.608600 21.745300 3.42
6 Norway 177262 0.0 0.0 827 60.472000 8.468900 0.47
7 Oman 303163 0.0 0.0 4089 21.512583 55.923255 1.35
8 Pakistan 1210082 0.0 0.0 26865 34.027401 73.947253 2.22
9 Palau 2 0.0 0.0 0 7.515000 134.582500 0.00
10 Panama 462224 0.0 0.0 7137 8.538000 -80.782100 1.54
11 Papua New Guinea 18412 0.0 0.0 196 -6.314993 143.955550 1.06
12 Paraguay 459340 0.0 0.0 16109 -23.442500 -58.443800 3.51
13 Peru 2161358 0.0 0.0 198799 -9.190000 -75.015200 9.20
14 Philippines 2248071 0.0 0.0 35307 12.879721 121.774017 1.57


As of 28.04.20, Belgium has the highest death rate. It is to note that Belgium, unlike UK, Spain or Italy, include in the Covid-19 death toll fatalities outside hospitals as well as people suspected of having died of Covid-19.

As of 13.03.20, most of the victims Covid-19 positive in Italy were 70 years old and older 1.
According to a study by the Leverhulme Centre for Demographic Science at the University of Oxford 1 and a publication in FAZ 2, some of the main reasons for Italy's high death rate are:

  • Italy is the world's second oldest population after Japan with 23% of the population aged 65 years and older compared to 13.2% younger than 16 year old.
  • As of 26.03, the average age of cases in Italy is 63 years.
  • Northern Italy, the area in Italy with most cases, is the industrial heart of the country with high pollution levels and there are many inhabitants suffering from respiratory problems.
  • Family is very important in Italy and the contact between generations is close, even if multigenerations are not living during the same roof, they live close by and see each other frequently.
  • Hugs and kisses on the cheek are common when greeting people in Italy.
  • The initial lockdown measures in the Northern provices caused a boomerang effect as it caused an exodus and many students from the North with family in the South travelled home, contributing to the spread of the virus.


In contrast, Germany has a low death rate from the top-10 list of active cases even though it was on the fourth place of active cases as of 05.04. Possible factors that may influence this number are:

  • As of 01.04. Germany is the country that has executed most Covid-19 tests in the world in order to identify positive cases, trace them and contain spread. Italy is the second and South Korea the third. 3
  • Over 70% of the cases in Germany are in the age range between 15 and 59 years 4.
  • As of 26.03 the average age of cases is 45 years.
  • The elderly generation in Germany tends to be independent and not as in close contact with younger generations as in Italy or Spain.
  • The testing and tracing strategy that has been executed since the early days and the social structure in Germany of the outbreak might be the main reasons why the spread hasn't impacted the elderly population as badly as in Italy, keeping the death rate low.
  • Germany does not run Covid-19 post-mortem tests.

Summary data per country is shown if you point on each country's block.

We check the length of the datasets for worldwide confimed, recovered and fatalities and group them by country to avoid several entries per country.

Length of worldwide confirmed cases table: 279
Length of worldwide fatalities table: 279
Length of worldwide recovered table: 264
Length of unique country values for worldwide confirmed cases: 195
Length of unique country values for worldwide fatalities: 195
Length of unique country values for worldwide recovered: 195

China and South Korea are added to the list for comparison reasons as they have been the two first countries to slow down and decrease the number of active cases after an outbreak.

Index(['Date', 'Afghanistan confirmed',
       'Saint Vincent and the Grenadines confirmed', 'Nicaragua confirmed',
       'Niger confirmed', 'Nigeria confirmed', 'China confirmed',
       'South Korea confirmed', 'Afghanistan fatalities',
       'Saint Vincent and the Grenadines fatalities', 'Nicaragua fatalities',
       'Niger fatalities', 'Nigeria fatalities', 'China fatalities',
       'South Korea fatalities', 'Afghanistan recovered',
       'Saint Vincent and the Grenadines recovered', 'Nicaragua recovered',
       'Niger recovered', 'Nigeria recovered', 'China recovered',
       'South Korea recovered'],
      dtype='object', name='Country/Region')

On 26.03, US has surpassed Italy in number of active cases and China in number of total positives.
US curve of active cases has grown dramatically since mid March.
Italy started a strong growth of cases around the carnival festivities in the third week of February whereas Germany's and Spain's curves of active cases started to go up at a fast pace around two weeks after. In Germany many of the initial cases were connected to people who went for a winter holiday in northern Italy in the third week of February.
As of 05.04, the growth of the active cases curves in the top European countries is showing signs of slowing down.
Question: Why did the curve in US take momentum in such a short timeframe in comparison to Italy, Germany and Spain? Might it be related to the lack of early contigency measures and lack of initial testing?
As in many of the countries with most active cases, the most dense and international cities are the ones worst hit by the outbreak. In US, the biggest hub of cases is located in NYC.

We compare the curves above with the one from China, the country with the first reported outbreak. It is worth mentioning that the official data provided by China has been questioned by the international community.
We added also the graph from South Korea where the active cases are starting to drop since mif March.
In both countries the peak of active cases shows in the graphs around a month since the curve started to increase at a fast pace.
Both countries took strict measures to content the spread of the virus including lockdowns of hotspots, social distancing, self-isolation and closing of public places and schools. In addition, in South Korea, a massive testing campaign was done to identify and traces cases in the earlier stages and limit the spread.
As of 15.06 the first wave of covid-19 cases has peaked in many Asian and European countries and restrictions in these countries have started to be eased. However, the risk of a second wave is latent as well as the development of out of control outbreaks in African countries as it is now in Brazil.

References

1 Jennifer Beam Dowd, Valentina Rotondi, Liliana Andriano, David M. Brazel, Per Block, Xuejie Ding, Yan Liu, Melinda C. Mills.
Demographic science aids in understanding the spread and fatality rates of COVID-19. March 15, 2020

2 Andreas Rossmann. Warum ist es in Italien so schlimm? Frankfurter Allgemeine Zeiting. March 23, 2020.

3 Frauke Suhr. So oft wird auf COVID-19 getestet. May 4, 2020.

4 Corona-Infektionen (COVID-19) in Deutschland nach Altersgruppen. May 17, 2020.