| country | country_code | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1100.0 |
| 1 | Angola | AGO | 19000.0 | 20000.0 | 21000.0 | 22000.0 | 23000.0 | 23000.0 | 24000.0 | 24000.0 | 24000.0 | 24000.0 | 24000.0 | 23000.0 |
| 2 | Argentina | ARG | 7000.0 | 7300.0 | 7400.0 | 7500.0 | 7600.0 | 7700.0 | 7700.0 | 7600.0 | 7500.0 | 7500.0 | NaN | 5600.0 |
| 3 | Armenia | ARM | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 120.0 | 300.0 | 300.0 | NaN | 350.0 |
| 4 | Australia | AUS | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1100.0 |
Global HIV/AIDS Analysis, 2000-2011
Introduction
In this workshop, you’ll learn how to clean and analyze HIV and AIDS data across different countries and years. We will start with raw datasets containing estimates of new and total HIV cases per country-year. The goal is to clean and merge these datasets with population data.
Load Libraries
Data Import
First, we import two raw datasets, each containing estimates per country-year. These data were accessed from the Gapminder foundation, at www.gapminder.org/data.
People living with HIV: This dataset contains the total number of people currently infected with HIV (data/hiv_prevalence.csv).
New HIV infections: This dataset provides the total number of people newly infected with HIV during a given year (data/hiv_incidence.csv).
Load the datasets into Python using pd.read_csv() then explore them to understand their structure.
Now, we import data on population estimates per country per year from the World Bank:
| country | indicator | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABW | SP.URB.TOTL | 4.162500e+04 | 4.202500e+04 | 4.219400e+04 | 4.227700e+04 | 4.231700e+04 | 4.239900e+04 | 4.255500e+04 | 4.272900e+04 | 4.290600e+04 | 4.307900e+04 | 4.320600e+04 | 4.349300e+04 | 4.386400e+04 | 4.422800e+04 | 4.458800e+04 | 4.494300e+04 | 4.529700e+04 | 4.564800e+04 |
| 1 | ABW | SP.URB.GROW | 1.664222e+00 | 9.563731e-01 | 4.013352e-01 | 1.965172e-01 | 9.456936e-02 | 1.935880e-01 | 3.672580e-01 | 4.080490e-01 | 4.133830e-01 | 4.023963e-01 | 2.943735e-01 | 6.620631e-01 | 8.493932e-01 | 8.264135e-01 | 8.106692e-01 | 7.930256e-01 | 7.845785e-01 | 7.718989e-01 |
| 2 | ABW | SP.POP.TOTL | 8.910100e+04 | 9.069100e+04 | 9.178100e+04 | 9.270100e+04 | 9.354000e+04 | 9.448300e+04 | 9.560600e+04 | 9.678700e+04 | 9.799600e+04 | 9.921200e+04 | 1.003410e+05 | 1.012880e+05 | 1.021120e+05 | 1.028800e+05 | 1.035940e+05 | 1.042570e+05 | 1.048740e+05 | 1.054390e+05 |
| 3 | ABW | SP.POP.GROW | 2.539234e+00 | 1.768757e+00 | 1.194718e+00 | 9.973955e-01 | 9.009892e-01 | 1.003077e+00 | 1.181566e+00 | 1.227711e+00 | 1.241397e+00 | 1.233231e+00 | 1.131541e+00 | 9.393559e-01 | 8.102306e-01 | 7.493010e-01 | 6.916153e-01 | 6.379592e-01 | 5.900625e-01 | 5.372957e-01 |
| 4 | AFE | SP.URB.TOTL | 1.155517e+08 | 1.197755e+08 | 1.242275e+08 | 1.288340e+08 | 1.336475e+08 | 1.387456e+08 | 1.440267e+08 | 1.492313e+08 | 1.553838e+08 | 1.617762e+08 | 1.684561e+08 | 1.754157e+08 | 1.825587e+08 | 1.901087e+08 | 1.980733e+08 | 2.065563e+08 | 2.150833e+08 | 2.237321e+08 |
Data Pivoting
Now that we have imported the data it’s time to pivot it.
The datasets are currently in wide format, with each year as a separate column, something like this:
| country | 1990 | 1991 | 1992 | 1993 | 1994 |
|---|---|---|---|---|---|
| Afghanistan | 600 | 600 | 1000 | 1100 | 1200 |
| Angola | 33000 | 39000 | 46000 | 54000 | 62000 |
| Argentina | 27000 | 31000 | 35000 | 39000 | 43000 |
| Armenia | 350 | 600 | 600 | 600 | 1000 |
| Australia | 12000 | 12000 | 12000 | 11000 | 11000 |
| Austria | 600 | 600 | 1000 | 1300 | 1500 |
We need to convert this to long format, where each row represents a single year-country combination.
For example:
| country | year | cases |
|---|---|---|
| Afghanistan | 1990 | 600 |
| Afghanistan | 1991 | 600 |
| Afghanistan | 1992 | 1000 |
| Afghanistan | 1993 | 1100 |
| Afghanistan | 1994 | 1200 |
Instructions for Pivoting new_hiv_raw Data
Reshape the new_hiv_raw dataset by converting the columns that represent different years into a new column called year, and place the corresponding values into a new column called new_cases.
Hints:
Use
melt()onnew_hiv_raw.Exclude the
countryandcountry_codecolumns from being melted by specifyingid_vars=['country', 'country_code'].Set the names of the new columns inside
melt()using:var_name='year'to label the column that will store the years.value_name='new_cases'to store the values representing new HIV cases under thenew_casescolumn.
| country | country_code | year | new_cases | |
|---|---|---|---|---|
| 0 | Afghanistan | AFG | 2000 | NaN |
| 142 | Afghanistan | AFG | 2001 | NaN |
| 284 | Afghanistan | AFG | 2002 | NaN |
| 426 | Afghanistan | AFG | 2003 | NaN |
| 568 | Afghanistan | AFG | 2004 | NaN |
Instructions for Pivoting total_hiv_raw Data
In a similar manner, pivot the total_hiv_raw dataset to a long format. Convert the year columns into a new column called year and the counts into total_cases.
| country | country_code | year | total_cases | |
|---|---|---|---|---|
| 0 | Afghanistan | AFG | 2000 | 1900.0 |
| 148 | Afghanistan | AFG | 2001 | 2000.0 |
| 296 | Afghanistan | AFG | 2002 | 2200.0 |
| 444 | Afghanistan | AFG | 2003 | 2400.0 |
| 592 | Afghanistan | AFG | 2004 | 2700.0 |
CHECKPOINT:
After performing these steps, make sure that you have two reshaped datasets:
new_hiv_longwith 4 columns:country,country_code,year, andnew_cases.total_hiv_longwith 4 columns:country,country_code,year, andtotal_cases.
Joining the HIV Datasets
Now that both datasets are in long format, we can join them.
Note that you need to use the country, country_code, and year columns as the keys (on=['country', 'country_code', 'year']).
Use pd.merge() with how='outer' so that we can keep as much information as possible.
| country | country_code | year | new_cases | total_cases | |
|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 2000 | NaN | 1900.0 |
| 1 | Afghanistan | AFG | 2001 | NaN | 2000.0 |
| 2 | Afghanistan | AFG | 2002 | NaN | 2200.0 |
| 3 | Afghanistan | AFG | 2003 | NaN | 2400.0 |
| 4 | Afghanistan | AFG | 2004 | NaN | 2700.0 |
CHECKPOINT: The combined dataset should have columns for country, country_code, year, total_cases, and new_cases. There may some missing values; that is okay.
Add Population Data
The population dataset contains information on population per year from 2000 to 2017.
| country | indicator | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABW | SP.URB.TOTL | 4.162500e+04 | 4.202500e+04 | 4.219400e+04 | 4.227700e+04 | 4.231700e+04 | 4.239900e+04 | 4.255500e+04 | 4.272900e+04 | 4.290600e+04 | 4.307900e+04 | 4.320600e+04 | 4.349300e+04 | 4.386400e+04 | 4.422800e+04 | 4.458800e+04 | 4.494300e+04 | 4.529700e+04 | 4.564800e+04 |
| 1 | ABW | SP.URB.GROW | 1.664222e+00 | 9.563731e-01 | 4.013352e-01 | 1.965172e-01 | 9.456936e-02 | 1.935880e-01 | 3.672580e-01 | 4.080490e-01 | 4.133830e-01 | 4.023963e-01 | 2.943735e-01 | 6.620631e-01 | 8.493932e-01 | 8.264135e-01 | 8.106692e-01 | 7.930256e-01 | 7.845785e-01 | 7.718989e-01 |
| 2 | ABW | SP.POP.TOTL | 8.910100e+04 | 9.069100e+04 | 9.178100e+04 | 9.270100e+04 | 9.354000e+04 | 9.448300e+04 | 9.560600e+04 | 9.678700e+04 | 9.799600e+04 | 9.921200e+04 | 1.003410e+05 | 1.012880e+05 | 1.021120e+05 | 1.028800e+05 | 1.035940e+05 | 1.042570e+05 | 1.048740e+05 | 1.054390e+05 |
| 3 | ABW | SP.POP.GROW | 2.539234e+00 | 1.768757e+00 | 1.194718e+00 | 9.973955e-01 | 9.009892e-01 | 1.003077e+00 | 1.181566e+00 | 1.227711e+00 | 1.241397e+00 | 1.233231e+00 | 1.131541e+00 | 9.393559e-01 | 8.102306e-01 | 7.493010e-01 | 6.916153e-01 | 6.379592e-01 | 5.900625e-01 | 5.372957e-01 |
| 4 | AFE | SP.URB.TOTL | 1.155517e+08 | 1.197755e+08 | 1.242275e+08 | 1.288340e+08 | 1.336475e+08 | 1.387456e+08 | 1.440267e+08 | 1.492313e+08 | 1.553838e+08 | 1.617762e+08 | 1.684561e+08 | 1.754157e+08 | 1.825587e+08 | 1.901087e+08 | 1.980733e+08 | 2.065563e+08 | 2.150833e+08 | 2.237321e+08 |
Note that we are interested in the total population of each country, which is labelled SP.POP.TOTL in the dataset.
Additionally, note that the country column doesn’t have full country names, but the 3-letter country codes.
This will make it easier to join with our HIV dataset, since country codes are consistent across datasets (unlike full country names).
Filter and Prepare Population Data
Next, complete the following transformations on the population dataset:
Query the data to just the
SP.POP.TOTLindicatorMelt the data to a long format.
Rename the “country” column to “country_code” so that we can join it properly later on.
Drop the
indicatorcolumn.
Store your final result as population_long.
CHECKPOINT:
After performing these steps, make sure that you have a dataset named population_long with 3 columns: country_code, year, and population.
Join Population Data with HIV Data
Use pd.merge() to add the population information to your HIV data combined dataset (the HIV data should be the left or primary dataset). You should use both “country_code” and “year” as keys.
| country | country_code | year | new_cases | total_cases | population | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 2000 | NaN | 1900.0 | 19542982.0 |
| 1 | Afghanistan | AFG | 2001 | NaN | 2000.0 | 19688632.0 |
| 2 | Afghanistan | AFG | 2002 | NaN | 2200.0 | 21000256.0 |
| 3 | Afghanistan | AFG | 2003 | NaN | 2400.0 | 22645130.0 |
| 4 | Afghanistan | AFG | 2004 | NaN | 2700.0 | 23553551.0 |
CHECKPOINT: Verify that hiv_data_combined_pop now contains a population column.
Calculate Incidence Rate and Prevalence Rate
Lastly, calculate the incidence rate and prevalence rate per 100,000 people:
- Incidence Rate = New Cases / Population * 100000
- Prevalence Rate = Total Cases / Population * 100000
| country | country_code | year | new_cases | total_cases | population | incidence_rate | prevalence_rate | |
|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 2000 | NaN | 1900.0 | 19542982.0 | NaN | 9.722160 |
| 1 | Afghanistan | AFG | 2001 | NaN | 2000.0 | 19688632.0 | NaN | 10.158146 |
| 2 | Afghanistan | AFG | 2002 | NaN | 2200.0 | 21000256.0 | NaN | 10.476063 |
| 3 | Afghanistan | AFG | 2003 | NaN | 2400.0 | 22645130.0 | NaN | 10.598305 |
| 4 | Afghanistan | AFG | 2004 | NaN | 2700.0 | 23553551.0 | NaN | 11.463240 |
Plot Incidence Rates
Now, you’ll select 3 countries of your choice and plot their incidence rate over time.
- Query the data to narrow down the countries.
- Convert the
yearcolumn to anumerictype withastype(int).
Store the result as hiv_data_mini.
Note that some countries have many years with missing data, especially for new_cases.
country country_code year new_cases total_cases population \
12 Angola AGO 2000 19000.0 120000.0 16394062.0
13 Angola AGO 2001 20000.0 130000.0 16941587.0
14 Angola AGO 2002 21000.0 140000.0 17516139.0
15 Angola AGO 2003 22000.0 150000.0 18124342.0
16 Angola AGO 2004 23000.0 160000.0 18771125.0
17 Angola AGO 2005 23000.0 170000.0 19450959.0
18 Angola AGO 2006 24000.0 180000.0 20162340.0
19 Angola AGO 2007 24000.0 190000.0 20909684.0
20 Angola AGO 2008 24000.0 200000.0 21691522.0
21 Angola AGO 2009 24000.0 210000.0 22507674.0
22 Angola AGO 2010 24000.0 220000.0 23364185.0
23 Angola AGO 2011 23000.0 230000.0 24259111.0
1752 Zambia ZMB 2000 90000.0 840000.0 9891136.0
1753 Zambia ZMB 2001 92000.0 860000.0 10191964.0
1754 Zambia ZMB 2002 94000.0 880000.0 10508294.0
1755 Zambia ZMB 2003 96000.0 900000.0 10837973.0
1756 Zambia ZMB 2004 96000.0 910000.0 11188040.0
1757 Zambia ZMB 2005 93000.0 920000.0 11564870.0
1758 Zambia ZMB 2006 88000.0 920000.0 11971567.0
1759 Zambia ZMB 2007 81000.0 930000.0 12402073.0
1760 Zambia ZMB 2008 72000.0 940000.0 12852966.0
1761 Zambia ZMB 2009 76000.0 950000.0 13318087.0
1762 Zambia ZMB 2010 NaN 960000.0 13792086.0
1763 Zambia ZMB 2011 NaN 970000.0 14265814.0
1764 Zimbabwe ZWE 2000 160000.0 1900000.0 11834676.0
1765 Zimbabwe ZWE 2001 140000.0 1800000.0 11910978.0
1766 Zimbabwe ZWE 2002 130000.0 1700000.0 11984644.0
1767 Zimbabwe ZWE 2003 110000.0 1700000.0 12075828.0
1768 Zimbabwe ZWE 2004 110000.0 1600000.0 12160881.0
1769 Zimbabwe ZWE 2005 110000.0 1500000.0 12224753.0
1770 Zimbabwe ZWE 2006 100000.0 1400000.0 12330490.0
1771 Zimbabwe ZWE 2007 100000.0 1300000.0 12450568.0
1772 Zimbabwe ZWE 2008 100000.0 1300000.0 12550347.0
1773 Zimbabwe ZWE 2009 95000.0 1200000.0 12679810.0
1774 Zimbabwe ZWE 2010 87000.0 1200000.0 12839771.0
1775 Zimbabwe ZWE 2011 74000.0 1200000.0 13025785.0
incidence_rate prevalence_rate
12 115.895621 731.972345
13 118.052695 767.342516
14 119.889435 799.262897
15 121.383717 827.616252
16 122.528618 852.372993
17 118.246098 873.992897
18 119.033803 892.753520
19 114.779353 908.669878
20 110.642305 922.019211
21 106.630299 933.015113
22 102.721323 941.612130
23 94.809740 948.097397
1752 909.905596 8492.452232
1753 902.671948 8438.020385
1754 894.531501 8374.337452
1755 885.774489 8304.135838
1756 858.059142 8133.685614
1757 804.159493 7955.126171
1758 735.075032 7684.875338
1759 653.116620 7498.746379
1760 560.181984 7313.487019
1761 570.652527 7133.156586
1762 NaN 6960.513442
1763 NaN 6799.471800
1764 1351.959276 16054.516406
1765 1175.386270 15112.109182
1766 1084.721415 14184.818506
1767 910.910622 14077.709620
1768 904.539729 13156.941508
1769 899.813681 12270.186563
1770 810.997779 11353.968901
1771 803.176209 10441.290711
1772 796.790718 10358.279337
1773 749.222583 9463.864206
1774 677.582178 9345.961077
1775 568.103957 9212.496598
Now plot incidence rates over time using Plotly Express.
Plot Prevalence Rates
Plot their prevalence rates over time, using a line graph.
Optional Ungraded Challenge
In this challenge, you will:
- Add Regional Information: Use the
country_converterpackage to get regional (continent) information for each country.
| country | country_code | year | new_cases | total_cases | population | incidence_rate | prevalence_rate | continent | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 2000 | NaN | 1900.0 | 19542982.0 | NaN | 9.722160 | Asia |
| 1 | Afghanistan | AFG | 2001 | NaN | 2000.0 | 19688632.0 | NaN | 10.158146 | Asia |
| 2 | Afghanistan | AFG | 2002 | NaN | 2200.0 | 21000256.0 | NaN | 10.476063 | Asia |
| 3 | Afghanistan | AFG | 2003 | NaN | 2400.0 | 22645130.0 | NaN | 10.598305 | Asia |
| 4 | Afghanistan | AFG | 2004 | NaN | 2700.0 | 23553551.0 | NaN | 11.463240 | Asia |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1771 | Zimbabwe | ZWE | 2007 | 100000.0 | 1300000.0 | 12450568.0 | 803.176209 | 10441.290711 | Africa |
| 1772 | Zimbabwe | ZWE | 2008 | 100000.0 | 1300000.0 | 12550347.0 | 796.790718 | 10358.279337 | Africa |
| 1773 | Zimbabwe | ZWE | 2009 | 95000.0 | 1200000.0 | 12679810.0 | 749.222583 | 9463.864206 | Africa |
| 1774 | Zimbabwe | ZWE | 2010 | 87000.0 | 1200000.0 | 12839771.0 | 677.582178 | 9345.961077 | Africa |
| 1775 | Zimbabwe | ZWE | 2011 | 74000.0 | 1200000.0 | 13025785.0 | 568.103957 | 9212.496598 | Africa |
1776 rows × 9 columns
- Identify Highest Prevalence Rates per Continent: Find the countries with the highest prevalence rate in 2011 for each continent
continent country prevalence_rate
551 Africa Eswatini 17188.799055
95 America Bahamas 1719.804207
1607 Asia Thailand 713.112654
539 Europe Estonia 745.796982
1283 Oceania Papua New Guinea 358.669168
- Create a Bar Chart: Make a bar chart comparing the 2011 prevalence rates of these countries.
Analysis of HIV and AIDS data across different countries and years.
These data were accessed from the Gapminder foundation, at www.gapminder.org/data.
People living with HIV: This dataset contains the total number of people currently infected with HIV (data/hiv_prevalence.csv).
New HIV infections: This dataset provides the total number of people newly infected with HIV during a given year (data/hiv_incidence.csv).
Two HIV data have been merged with population data
The incidence Rate and prevalence Rate is calculated and added as columns
| country | country_code | year | new_cases | total_cases | population | incidence_rate | prevalence_rate | continent |
|---|---|---|---|---|---|---|---|---|
| Loading ITables v2.3.0 from the internet... (need help?) |
HIV Incidence Rate Trends in Selected Countries
Using the plotly express package, we plotted the HIV incidence rate trends over time in three selected African countries, Angola, Zambia and Zimbabwe.
The plot illustrates the temporal trends in HIV incidence rates (per 100,000 population) from the year 2000 to approximately 2011 across three Sub-Saharan African countries: Angola, Zambia, and Zimbabwe.
Zimbabwe begins the observed period with the highest HIV incidence rate, exceeding 1300 per 100,000 in 2000. However, a notable downward trend is observed throughout the decade, with incidence rates decreasing to under 600 by 2011. This substantial reduction suggests the effectiveness of nationwide prevention strategies, treatment scale-up, and public health interventions during this period.
Zambia also exhibits a declining trend, although it starts at a lower incidence than Zimbabwe. The decrease is more gradual, indicating consistent though less dramatic progress in reducing new HIV infections.
Angola, on the other hand, maintains a relatively low and stable incidence rate, hovering around 100–150 per 100,000. This stability could reflect either consistently lower transmission dynamics or, alternatively, underreporting or under-detection of cases due to limited surveillance capacity or healthcare infrastructure.
HIV prevalence rates over time in the same selected countries
This plot presents the HIV prevalence rates (per 100,000 population) from 2000 to 2011 in three Sub-Saharan African countries: Angola, Zambia, and Zimbabwe. Prevalence reflects the total number of people living with HIV at a given time, incorporating both new infections and existing cases.
Zimbabwe again begins the period with the highest prevalence rate, peaking above 16,000 per 100,000 (~16%) in 2000. A marked decline in prevalence is observed over the subsequent decade, dropping to about 10,000 per 100,000 by 2011. This trend, while positive, also suggests that while new infections decreased (as seen in the incidence plot), many individuals remained infected and survived longer, likely due to the expanded access to antiretroviral therapy (ART).
Zambia shows a similar downward trajectory, though starting from a slightly lower base. The reduction from approximately 12,000 per 100,000 to under 9,000 over a decade indicates parallel gains in HIV management and reduced transmission, though the sustained high prevalence still underscores a heavy disease burden.
Angola exhibits a consistently low prevalence rate throughout the time period, ranging between 1,000 and 1,500 per 100,000. This may reflect a genuinely lower rate of HIV transmission in the country, but it also raises questions about surveillance coverage, diagnostic capacity, and reporting accuracy, especially in the early 2000s.
The country with the highest prevalence rate for each continent in the year 2011
| continent | country | prevalence_rate | |
|---|---|---|---|
| 551 | Africa | Eswatini | 17188.799055 |
| 95 | America | Bahamas | 1719.804207 |
| 1607 | Asia | Thailand | 713.112654 |
| 539 | Europe | Estonia | 745.796982 |
| 1283 | Oceania | Papua New Guinea | 358.669168 |
“Visualization of countries with the highest Prevalence Rate by Continent (2011)”
Public Health Analysis
The declining prevalence in Zambia and Zimbabwe, alongside falling incidence rates (from the previous plot), suggests mature epidemics transitioning into better-controlled public health challenges. However, high prevalence despite declining incidence implies that more individuals are living longer with HIV, a hallmark of successful ART scale-up, but also an indication of the ongoing need for long-term care, support, and monitoring.
Angola’s flat trend in both incidence and prevalence calls for deeper investigation. A true low-prevalence setting is promising, but only if surveillance and data systems are robust. Without that assurance, these numbers may underestimate the real burden of the disease.