Global HIV/AIDS Analysis, 2000-2011

Author

Olugbenga Asaolu

Introduction

In this workshop, you’ll learn how to clean and analyze HIV and AIDS data across different countries and years. We will start with raw datasets containing estimates of new and total HIV cases per country-year. The goal is to clean and merge these datasets with population data.

Load Libraries

Data Import

First, we import two raw datasets, each containing estimates per country-year. These data were accessed from the Gapminder foundation, at www.gapminder.org/data.

People living with HIV: This dataset contains the total number of people currently infected with HIV (data/hiv_prevalence.csv).
New HIV infections: This dataset provides the total number of people newly infected with HIV during a given year (data/hiv_incidence.csv).

Load the datasets into Python using pd.read_csv() then explore them to understand their structure.

	country	country_code	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011
0	Afghanistan	AFG	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1100.0
1	Angola	AGO	19000.0	20000.0	21000.0	22000.0	23000.0	23000.0	24000.0	24000.0	24000.0	24000.0	24000.0	23000.0
2	Argentina	ARG	7000.0	7300.0	7400.0	7500.0	7600.0	7700.0	7700.0	7600.0	7500.0	7500.0	NaN	5600.0
3	Armenia	ARM	120.0	120.0	120.0	120.0	120.0	120.0	120.0	120.0	300.0	300.0	NaN	350.0
4	Australia	AUS	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1100.0

Now, we import data on population estimates per country per year from the World Bank:

	country	indicator	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017
0	ABW	SP.URB.TOTL	4.162500e+04	4.202500e+04	4.219400e+04	4.227700e+04	4.231700e+04	4.239900e+04	4.255500e+04	4.272900e+04	4.290600e+04	4.307900e+04	4.320600e+04	4.349300e+04	4.386400e+04	4.422800e+04	4.458800e+04	4.494300e+04	4.529700e+04	4.564800e+04
1	ABW	SP.URB.GROW	1.664222e+00	9.563731e-01	4.013352e-01	1.965172e-01	9.456936e-02	1.935880e-01	3.672580e-01	4.080490e-01	4.133830e-01	4.023963e-01	2.943735e-01	6.620631e-01	8.493932e-01	8.264135e-01	8.106692e-01	7.930256e-01	7.845785e-01	7.718989e-01
2	ABW	SP.POP.TOTL	8.910100e+04	9.069100e+04	9.178100e+04	9.270100e+04	9.354000e+04	9.448300e+04	9.560600e+04	9.678700e+04	9.799600e+04	9.921200e+04	1.003410e+05	1.012880e+05	1.021120e+05	1.028800e+05	1.035940e+05	1.042570e+05	1.048740e+05	1.054390e+05
3	ABW	SP.POP.GROW	2.539234e+00	1.768757e+00	1.194718e+00	9.973955e-01	9.009892e-01	1.003077e+00	1.181566e+00	1.227711e+00	1.241397e+00	1.233231e+00	1.131541e+00	9.393559e-01	8.102306e-01	7.493010e-01	6.916153e-01	6.379592e-01	5.900625e-01	5.372957e-01
4	AFE	SP.URB.TOTL	1.155517e+08	1.197755e+08	1.242275e+08	1.288340e+08	1.336475e+08	1.387456e+08	1.440267e+08	1.492313e+08	1.553838e+08	1.617762e+08	1.684561e+08	1.754157e+08	1.825587e+08	1.901087e+08	1.980733e+08	2.065563e+08	2.150833e+08	2.237321e+08

Data Pivoting

Now that we have imported the data it’s time to pivot it.

The datasets are currently in wide format, with each year as a separate column, something like this:

country	1990	1991	1992	1993	1994
Afghanistan	600	600	1000	1100	1200
Angola	33000	39000	46000	54000	62000
Argentina	27000	31000	35000	39000	43000
Armenia	350	600	600	600	1000
Australia	12000	12000	12000	11000	11000
Austria	600	600	1000	1300	1500

We need to convert this to long format, where each row represents a single year-country combination.

For example:

country	year	cases
Afghanistan	1990	600
Afghanistan	1991	600
Afghanistan	1992	1000
Afghanistan	1993	1100
Afghanistan	1994	1200

Instructions for Pivoting `new_hiv_raw` Data

Reshape the new_hiv_raw dataset by converting the columns that represent different years into a new column called year, and place the corresponding values into a new column called new_cases.

Hints:

Use melt() on new_hiv_raw.
Exclude the country and country_code columns from being melted by specifying id_vars=['country', 'country_code'].
Set the names of the new columns inside melt() using:
- var_name='year' to label the column that will store the years.
- value_name='new_cases' to store the values representing new HIV cases under the new_cases column.

	country	country_code	year	new_cases
0	Afghanistan	AFG	2000	NaN
142	Afghanistan	AFG	2001	NaN
284	Afghanistan	AFG	2002	NaN
426	Afghanistan	AFG	2003	NaN
568	Afghanistan	AFG	2004	NaN

Instructions for Pivoting `total_hiv_raw` Data

In a similar manner, pivot the total_hiv_raw dataset to a long format. Convert the year columns into a new column called year and the counts into total_cases.

	country	country_code	year	total_cases
0	Afghanistan	AFG	2000	1900.0
148	Afghanistan	AFG	2001	2000.0
296	Afghanistan	AFG	2002	2200.0
444	Afghanistan	AFG	2003	2400.0
592	Afghanistan	AFG	2004	2700.0

CHECKPOINT:

After performing these steps, make sure that you have two reshaped datasets:

new_hiv_long with 4 columns: country, country_code, year, and new_cases.
total_hiv_long with 4 columns: country, country_code, year, and total_cases.

Joining the HIV Datasets

Now that both datasets are in long format, we can join them.

Note that you need to use the country, country_code, and year columns as the keys (on=['country', 'country_code', 'year']).

Use pd.merge() with how='outer' so that we can keep as much information as possible.

	country	country_code	year	new_cases	total_cases
0	Afghanistan	AFG	2000	NaN	1900.0
1	Afghanistan	AFG	2001	NaN	2000.0
2	Afghanistan	AFG	2002	NaN	2200.0
3	Afghanistan	AFG	2003	NaN	2400.0
4	Afghanistan	AFG	2004	NaN	2700.0

CHECKPOINT: The combined dataset should have columns for country, country_code, year, total_cases, and new_cases. There may some missing values; that is okay.

Add Population Data

The population dataset contains information on population per year from 2000 to 2017.

	country	indicator	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017
0	ABW	SP.URB.TOTL	4.162500e+04	4.202500e+04	4.219400e+04	4.227700e+04	4.231700e+04	4.239900e+04	4.255500e+04	4.272900e+04	4.290600e+04	4.307900e+04	4.320600e+04	4.349300e+04	4.386400e+04	4.422800e+04	4.458800e+04	4.494300e+04	4.529700e+04	4.564800e+04
1	ABW	SP.URB.GROW	1.664222e+00	9.563731e-01	4.013352e-01	1.965172e-01	9.456936e-02	1.935880e-01	3.672580e-01	4.080490e-01	4.133830e-01	4.023963e-01	2.943735e-01	6.620631e-01	8.493932e-01	8.264135e-01	8.106692e-01	7.930256e-01	7.845785e-01	7.718989e-01
2	ABW	SP.POP.TOTL	8.910100e+04	9.069100e+04	9.178100e+04	9.270100e+04	9.354000e+04	9.448300e+04	9.560600e+04	9.678700e+04	9.799600e+04	9.921200e+04	1.003410e+05	1.012880e+05	1.021120e+05	1.028800e+05	1.035940e+05	1.042570e+05	1.048740e+05	1.054390e+05
3	ABW	SP.POP.GROW	2.539234e+00	1.768757e+00	1.194718e+00	9.973955e-01	9.009892e-01	1.003077e+00	1.181566e+00	1.227711e+00	1.241397e+00	1.233231e+00	1.131541e+00	9.393559e-01	8.102306e-01	7.493010e-01	6.916153e-01	6.379592e-01	5.900625e-01	5.372957e-01
4	AFE	SP.URB.TOTL	1.155517e+08	1.197755e+08	1.242275e+08	1.288340e+08	1.336475e+08	1.387456e+08	1.440267e+08	1.492313e+08	1.553838e+08	1.617762e+08	1.684561e+08	1.754157e+08	1.825587e+08	1.901087e+08	1.980733e+08	2.065563e+08	2.150833e+08	2.237321e+08

Note that we are interested in the total population of each country, which is labelled SP.POP.TOTL in the dataset.

Additionally, note that the country column doesn’t have full country names, but the 3-letter country codes.

This will make it easier to join with our HIV dataset, since country codes are consistent across datasets (unlike full country names).

Filter and Prepare Population Data

Next, complete the following transformations on the population dataset:

Query the data to just the SP.POP.TOTL indicator
Melt the data to a long format.
Rename the “country” column to “country_code” so that we can join it properly later on.
Drop the indicator column.

Store your final result as population_long.

CHECKPOINT:

After performing these steps, make sure that you have a dataset named population_long with 3 columns: country_code, year, and population.

Join Population Data with HIV Data

Use pd.merge() to add the population information to your HIV data combined dataset (the HIV data should be the left or primary dataset). You should use both “country_code” and “year” as keys.

	country	country_code	year	new_cases	total_cases	population
0	Afghanistan	AFG	2000	NaN	1900.0	19542982.0
1	Afghanistan	AFG	2001	NaN	2000.0	19688632.0
2	Afghanistan	AFG	2002	NaN	2200.0	21000256.0
3	Afghanistan	AFG	2003	NaN	2400.0	22645130.0
4	Afghanistan	AFG	2004	NaN	2700.0	23553551.0

CHECKPOINT: Verify that hiv_data_combined_pop now contains a population column.

Calculate Incidence Rate and Prevalence Rate

Lastly, calculate the incidence rate and prevalence rate per 100,000 people:

Incidence Rate = New Cases / Population * 100000
Prevalence Rate = Total Cases / Population * 100000

	country	country_code	year	new_cases	total_cases	population	incidence_rate	prevalence_rate
0	Afghanistan	AFG	2000	NaN	1900.0	19542982.0	NaN	9.722160
1	Afghanistan	AFG	2001	NaN	2000.0	19688632.0	NaN	10.158146
2	Afghanistan	AFG	2002	NaN	2200.0	21000256.0	NaN	10.476063
3	Afghanistan	AFG	2003	NaN	2400.0	22645130.0	NaN	10.598305
4	Afghanistan	AFG	2004	NaN	2700.0	23553551.0	NaN	11.463240

Plot Incidence Rates

Now, you’ll select 3 countries of your choice and plot their incidence rate over time.

Query the data to narrow down the countries.
Convert the year column to a numeric type with astype(int).

Store the result as hiv_data_mini.

Note that some countries have many years with missing data, especially for new_cases.

       country country_code  year  new_cases  total_cases  population  \
12      Angola          AGO  2000    19000.0     120000.0  16394062.0   
13      Angola          AGO  2001    20000.0     130000.0  16941587.0   
14      Angola          AGO  2002    21000.0     140000.0  17516139.0   
15      Angola          AGO  2003    22000.0     150000.0  18124342.0   
16      Angola          AGO  2004    23000.0     160000.0  18771125.0   
17      Angola          AGO  2005    23000.0     170000.0  19450959.0   
18      Angola          AGO  2006    24000.0     180000.0  20162340.0   
19      Angola          AGO  2007    24000.0     190000.0  20909684.0   
20      Angola          AGO  2008    24000.0     200000.0  21691522.0   
21      Angola          AGO  2009    24000.0     210000.0  22507674.0   
22      Angola          AGO  2010    24000.0     220000.0  23364185.0   
23      Angola          AGO  2011    23000.0     230000.0  24259111.0   
1752    Zambia          ZMB  2000    90000.0     840000.0   9891136.0   
1753    Zambia          ZMB  2001    92000.0     860000.0  10191964.0   
1754    Zambia          ZMB  2002    94000.0     880000.0  10508294.0   
1755    Zambia          ZMB  2003    96000.0     900000.0  10837973.0   
1756    Zambia          ZMB  2004    96000.0     910000.0  11188040.0   
1757    Zambia          ZMB  2005    93000.0     920000.0  11564870.0   
1758    Zambia          ZMB  2006    88000.0     920000.0  11971567.0   
1759    Zambia          ZMB  2007    81000.0     930000.0  12402073.0   
1760    Zambia          ZMB  2008    72000.0     940000.0  12852966.0   
1761    Zambia          ZMB  2009    76000.0     950000.0  13318087.0   
1762    Zambia          ZMB  2010        NaN     960000.0  13792086.0   
1763    Zambia          ZMB  2011        NaN     970000.0  14265814.0   
1764  Zimbabwe          ZWE  2000   160000.0    1900000.0  11834676.0   
1765  Zimbabwe          ZWE  2001   140000.0    1800000.0  11910978.0   
1766  Zimbabwe          ZWE  2002   130000.0    1700000.0  11984644.0   
1767  Zimbabwe          ZWE  2003   110000.0    1700000.0  12075828.0   
1768  Zimbabwe          ZWE  2004   110000.0    1600000.0  12160881.0   
1769  Zimbabwe          ZWE  2005   110000.0    1500000.0  12224753.0   
1770  Zimbabwe          ZWE  2006   100000.0    1400000.0  12330490.0   
1771  Zimbabwe          ZWE  2007   100000.0    1300000.0  12450568.0   
1772  Zimbabwe          ZWE  2008   100000.0    1300000.0  12550347.0   
1773  Zimbabwe          ZWE  2009    95000.0    1200000.0  12679810.0   
1774  Zimbabwe          ZWE  2010    87000.0    1200000.0  12839771.0   
1775  Zimbabwe          ZWE  2011    74000.0    1200000.0  13025785.0   

      incidence_rate  prevalence_rate  
12        115.895621       731.972345  
13        118.052695       767.342516  
14        119.889435       799.262897  
15        121.383717       827.616252  
16        122.528618       852.372993  
17        118.246098       873.992897  
18        119.033803       892.753520  
19        114.779353       908.669878  
20        110.642305       922.019211  
21        106.630299       933.015113  
22        102.721323       941.612130  
23         94.809740       948.097397  
1752      909.905596      8492.452232  
1753      902.671948      8438.020385  
1754      894.531501      8374.337452  
1755      885.774489      8304.135838  
1756      858.059142      8133.685614  
1757      804.159493      7955.126171  
1758      735.075032      7684.875338  
1759      653.116620      7498.746379  
1760      560.181984      7313.487019  
1761      570.652527      7133.156586  
1762             NaN      6960.513442  
1763             NaN      6799.471800  
1764     1351.959276     16054.516406  
1765     1175.386270     15112.109182  
1766     1084.721415     14184.818506  
1767      910.910622     14077.709620  
1768      904.539729     13156.941508  
1769      899.813681     12270.186563  
1770      810.997779     11353.968901  
1771      803.176209     10441.290711  
1772      796.790718     10358.279337  
1773      749.222583      9463.864206  
1774      677.582178      9345.961077  
1775      568.103957      9212.496598

Now plot incidence rates over time using Plotly Express.

Plot Prevalence Rates

Plot their prevalence rates over time, using a line graph.

Optional Ungraded Challenge

In this challenge, you will:

Add Regional Information: Use the country_converter package to get regional (continent) information for each country.

	country	country_code	year	new_cases	total_cases	population	incidence_rate	prevalence_rate	continent
0	Afghanistan	AFG	2000	NaN	1900.0	19542982.0	NaN	9.722160	Asia
1	Afghanistan	AFG	2001	NaN	2000.0	19688632.0	NaN	10.158146	Asia
2	Afghanistan	AFG	2002	NaN	2200.0	21000256.0	NaN	10.476063	Asia
3	Afghanistan	AFG	2003	NaN	2400.0	22645130.0	NaN	10.598305	Asia
4	Afghanistan	AFG	2004	NaN	2700.0	23553551.0	NaN	11.463240	Asia
...	...	...	...	...	...	...	...	...	...
1771	Zimbabwe	ZWE	2007	100000.0	1300000.0	12450568.0	803.176209	10441.290711	Africa
1772	Zimbabwe	ZWE	2008	100000.0	1300000.0	12550347.0	796.790718	10358.279337	Africa
1773	Zimbabwe	ZWE	2009	95000.0	1200000.0	12679810.0	749.222583	9463.864206	Africa
1774	Zimbabwe	ZWE	2010	87000.0	1200000.0	12839771.0	677.582178	9345.961077	Africa
1775	Zimbabwe	ZWE	2011	74000.0	1200000.0	13025785.0	568.103957	9212.496598	Africa

1776 rows × 9 columns

Identify Highest Prevalence Rates per Continent: Find the countries with the highest prevalence rate in 2011 for each continent

     continent           country  prevalence_rate
551     Africa          Eswatini     17188.799055
95     America           Bahamas      1719.804207
1607      Asia          Thailand       713.112654
539     Europe           Estonia       745.796982
1283   Oceania  Papua New Guinea       358.669168

Create a Bar Chart: Make a bar chart comparing the 2011 prevalence rates of these countries.

Analysis of HIV and AIDS data across different countries and years.

These data were accessed from the Gapminder foundation, at www.gapminder.org/data.

People living with HIV: This dataset contains the total number of people currently infected with HIV (data/hiv_prevalence.csv).
New HIV infections: This dataset provides the total number of people newly infected with HIV during a given year (data/hiv_incidence.csv).

Two HIV data have been merged with population data

The incidence Rate and prevalence Rate is calculated and added as columns

country	country_code	year	new_cases	total_cases	population	incidence_rate	prevalence_rate	continent
Loading ITables v2.3.0 from the internet... (need help?)

HIV Incidence Rate Trends in Selected Countries

Using the plotly express package, we plotted the HIV incidence rate trends over time in three selected African countries, Angola, Zambia and Zimbabwe.

The plot illustrates the temporal trends in HIV incidence rates (per 100,000 population) from the year 2000 to approximately 2011 across three Sub-Saharan African countries: Angola, Zambia, and Zimbabwe.

Zimbabwe begins the observed period with the highest HIV incidence rate, exceeding 1300 per 100,000 in 2000. However, a notable downward trend is observed throughout the decade, with incidence rates decreasing to under 600 by 2011. This substantial reduction suggests the effectiveness of nationwide prevention strategies, treatment scale-up, and public health interventions during this period.

Zambia also exhibits a declining trend, although it starts at a lower incidence than Zimbabwe. The decrease is more gradual, indicating consistent though less dramatic progress in reducing new HIV infections.

Angola, on the other hand, maintains a relatively low and stable incidence rate, hovering around 100–150 per 100,000. This stability could reflect either consistently lower transmission dynamics or, alternatively, underreporting or under-detection of cases due to limited surveillance capacity or healthcare infrastructure.

HIV prevalence rates over time in the same selected countries

This plot presents the HIV prevalence rates (per 100,000 population) from 2000 to 2011 in three Sub-Saharan African countries: Angola, Zambia, and Zimbabwe. Prevalence reflects the total number of people living with HIV at a given time, incorporating both new infections and existing cases.

Zimbabwe again begins the period with the highest prevalence rate, peaking above 16,000 per 100,000 (~16%) in 2000. A marked decline in prevalence is observed over the subsequent decade, dropping to about 10,000 per 100,000 by 2011. This trend, while positive, also suggests that while new infections decreased (as seen in the incidence plot), many individuals remained infected and survived longer, likely due to the expanded access to antiretroviral therapy (ART).

Zambia shows a similar downward trajectory, though starting from a slightly lower base. The reduction from approximately 12,000 per 100,000 to under 9,000 over a decade indicates parallel gains in HIV management and reduced transmission, though the sustained high prevalence still underscores a heavy disease burden.

Angola exhibits a consistently low prevalence rate throughout the time period, ranging between 1,000 and 1,500 per 100,000. This may reflect a genuinely lower rate of HIV transmission in the country, but it also raises questions about surveillance coverage, diagnostic capacity, and reporting accuracy, especially in the early 2000s.

The country with the highest prevalence rate for each continent in the year 2011

	continent	country	prevalence_rate
551	Africa	Eswatini	17188.799055
95	America	Bahamas	1719.804207
1607	Asia	Thailand	713.112654
539	Europe	Estonia	745.796982
1283	Oceania	Papua New Guinea	358.669168

“Visualization of countries with the highest Prevalence Rate by Continent (2011)”

Public Health Analysis

The declining prevalence in Zambia and Zimbabwe, alongside falling incidence rates (from the previous plot), suggests mature epidemics transitioning into better-controlled public health challenges. However, high prevalence despite declining incidence implies that more individuals are living longer with HIV, a hallmark of successful ART scale-up, but also an indication of the ongoing need for long-term care, support, and monitoring.

Angola’s flat trend in both incidence and prevalence calls for deeper investigation. A true low-prevalence setting is promising, but only if surveillance and data systems are robust. Without that assurance, these numbers may underestimate the real burden of the disease.