Alyssa Bockman & Thalia Koutsougeras

Final Project Disease Prevalence Across the States

Main Data from: https://think.cs.vt.edu/corgis/csv/health/

GitHub: https://tkoooo.github.io/Final-Portfolio/

Project Plan

Project Description and Goals: This dataset will allow us to examine the prevalence of various diseases across all 50 states over the last century. We plan to research factors such a vaccination rates and to try and understand the peaks and valleys in the dataset. We also want to examine what regions and states have been most and least affected and compare it with factors like weather patterns and population density.

Question: How does location correlate with probability of catching any 1 disease within the 20th century. Furhtermore, do other factors such as income and population density show this same correlation - this could potentially explain why certain states showed a higher incidence for disease when compared to others

Collaboration Plan: Our collaboration plan is to continue meeting at least once a week over zoom on Wednesday nights.

Data Collection - Disease Incidence Data

Data from: https://think.cs.vt.edu/corgis/csv/health/

In [121]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/cmps3160
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/cmps3160
In [122]:
%cd /content/drive/My Drive/cmps3160/finalProject
/content/drive/My Drive/cmps3160/finalProject

Data Processing - National Incidence and Case Numbers

In [123]:
import pandas as pd
import matplotlib.pyplot as plt
#import pandas and matplotlib for graphing

df = pd.read_csv("../finalProject/health.csv", sep=",")
#read dataset into variable df

#check what values are missing
df[df.isnull().any(axis=1)]
Out[123]:
disease increase loc number population year

Lucky for us, no data is missing and we can proceed onto analysis!

In [124]:
#Data types analysis
In [125]:
measles = df[df['disease'] == "MEASLES"]
measles['increase'].min()
Out[125]:
0.0
In [126]:
#in order to visualize the case prevelance across the entire country, we will drop the 
#location and increase columns and then create seperate dataframes for each disease
#df = df.drop(columns=['loc', 'increase'])
measles = df[df['disease'] == "MEASLES"]
polio = df[df['disease'] == "POLIO"]
smallpox = df[df['disease'] == "SMALLPOX"]
pertussis = df[df['disease'] == "PERTUSSIS"]
hepa = df[df['disease'] == "HEPATITIS A"]
rubella = df[df['disease'] == "RUBELLA"]
mumps = df[df['disease'] == "MUMPS"]

Now that we have datasets that hold only the specific disease information for each illness, we can begin to perform some basic analysis.

We will extract data to get numbers in the country instead of by state and perform further analysis to calculate the incidence for each disease each year.

In public health there are 2 main ways to measure how disease affects a population: incidence and prevalence.

Incidence refers to the number of people that develop/are diagnosed within a time period divided by the total sample size, while prevalence refers to the number of people being actively affected by the disease (whether they were diagnosed within the time period or not) divided by the total smaple size.

Because this data set provides us with the number of new cases per year and nothing about the amount of time people were sick, we will be focusing on calculating the incidence for each disease throughout the entire country.

This will provide 1 helpful dataset for each disease that will allow us to graph both the case numbers in the country as well as the proportion of people affected over the years.

In [127]:
#Basic function to create dataframes

def initial_dataframes(df):
  #grab the case numbers country wide for each year
  country = df.groupby(by='year')['number'].sum()

  #grab the population size country side for each year
  population = df.groupby(by='year')['population'].sum()

  #transform the number and population series in to dataframes for future operations
  country = country.to_frame()
  population = population.to_frame()

  #join the number and population dataframes to allow us to create a new incidence column
  incidence = population.join(other=country, on='year')

  #create a new column called incidence that shows the proportion of the population affected each year
  incidence["incidence"] = incidence["number"]/incidence["population"]

  #change the index to regular numbers rather than the year 
  incidence.reset_index(inplace =True)

  return country, population, incidence

We also need to bring in the vaccine dataframe before plotting our disease case numbers. The years for each vaccine we are investigating were taken from: https://www.immunize.org/timeline/

In [128]:
vaccine_data = [("Measles", 1963), ("Polio", 1955), ("Smallpox", 1798), ("Pertussis", 1915), ("Hepatitis A", 1995), ("Rubella", 1971), ("Mumps", 1971)]
vaccine_df = pd.DataFrame(vaccine_data, columns=['Disease', 'Year']) #Creating a dataframe out of each disease and the year their vaccine came out
vaccine_df = vaccine_df.sort_values('Year') #We want this sorted by year because we run into annotation issues later if not since Rubella and Mumps are in the same year
vaccine_df.reset_index(inplace=True)
vaccine_df.drop(columns=['index'], inplace=True) #Cleaning up data
vaccine_df
Out[128]:
Disease Year
0 Smallpox 1798
1 Pertussis 1915
2 Polio 1955
3 Measles 1963
4 Rubella 1971
5 Mumps 1971
6 Hepatitis A 1995

Measles

In [129]:
measles_country, measles_population, measles_incidence = initial_dataframes(measles)
measles_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Measles']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Measles'], 800000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Measles Case Numbers Over Time"); #Giving graph a title

Polio

In [130]:
polio_country, polio_population, polio_incidence = initial_dataframes(polio)
polio_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Polio']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Polio'], 50000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Polio Case Numbers Over Time"); #Giving graph a title

Smallpox

In [131]:
smallpox_country, smallpox_population, smallpox_incidence = initial_dataframes(smallpox)
smallpox_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers')
plt.title("Smallpox Case Numbers Over Time"); #Giving graph a title

Pertussis

In [132]:
pertussis_country, pertussis_population, pertussis_incidence = initial_dataframes(pertussis)
pertussis_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers')
plt.title("Pertussis Case Numbers Over Time"); #Giving graph a title

Hepatitis A

In [133]:
hepa_country, hepa_population, hepa_incidence = initial_dataframes(hepa)
hepa_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10),legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Hepatitis A']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Hepatitis A'], 55000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Hepatitis A Case Numbers Over Time"); #Giving graph a title

Rubella

In [134]:
rubella_country, rubella_population, rubella_incidence = initial_dataframes(rubella)
rubella_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Rubella']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Rubella'], 50000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Rubella Case Numbers Over Time"); #Giving graph a title

Mumps

In [135]:
mumps_country, mumps_population, mumps_incidence = initial_dataframes(mumps)
mumps_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Mumps']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Mumps'], 130000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Mumps Case Numbers Over Time"); #Giving graph a title

National Case Numbers and National Incidence

In [136]:
fig = plt.figure(figsize=(20,10));

colors = ['red', 'orange', 'green', 'blue', 'purple', 'brown', 'pink']; #colors for legend
colors_2 = ['green', 'blue', 'orange', 'red', 'brown', 'pink', 'purple']; #colors matching vaccine dataset
i = 0

# Plotting each disease over time
for frame in [measles_incidence, polio_incidence, smallpox_incidence, pertussis_incidence, hepa_incidence, rubella_incidence, mumps_incidence]:
    plt.plot(frame['year'], frame['number'], color=f'{colors[i]}');
    i += 1

for i in range(0, len(vaccine_df)): #this plots vertical lines where the vaccines came out
  if vaccine_df.iloc[i].Year < 1930: #this removes the whitespace that appears before the graph because of the vaccines that came out before 1930
    continue
  plt.axvline(x=vaccine_df.iloc[i].Year, color = f'{colors_2[i]}'); #this matches the color of the vertical line to the color of the disease in the legend
  plt.annotate(s = f'{vaccine_df.iloc[i].Disease}', xy=(vaccine_df.iloc[i].Year, 900000 - i*50000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.legend(["Measles", "Polio", "Smallpox", "Pertussis", "Hepatitis A", "Rubella", "Mumps"]);
plt.title("Disease Case Numbers");
plt.xlim(1925, 2015); #we have to cut off the limits because 2 of the vaccines appeared way before 1930 and make the graph hard to read
In [137]:
fig = plt.figure(figsize=(20,10));

colors = ['red', 'orange', 'green', 'blue', 'purple', 'brown', 'pink']; #colors for legend
colors_2 = ['green', 'blue', 'orange', 'red', 'brown', 'pink', 'purple']; #colors matching vaccine dataset
i = 0

# Plotting each disease over time
for frame in [measles_incidence, polio_incidence, smallpox_incidence, pertussis_incidence, hepa_incidence, rubella_incidence, mumps_incidence]:
    plt.plot(frame['year'], frame['incidence'], color=f'{colors[i]}');
    i += 1

for i in range(0, len(vaccine_df)): #this plots vertical lines where the vaccines came out
  if vaccine_df.iloc[i].Year < 1930: #this removes the whitespace that appears before the graph because of the vaccines that came out before 1930
    continue
  plt.axvline(x=vaccine_df.iloc[i].Year, color = f'{colors_2[i]}'); #this matches the color of the vertical line to the color of the disease in the legend
  plt.annotate(s = f'{vaccine_df.iloc[i].Disease}', xy=(vaccine_df.iloc[i].Year, 0.006 - i*0.0002), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.legend(["Measles", "Polio", "Smallpox", "Pertussis", "Hepatitis A", "Rubella", "Mumps"]);
plt.title("Disease Incidence");
plt.xlim(1925, 2015); #we have to cut off the limits because 2 of the vaccines appeared way before 1930 and make the graph hard to read

It is worth nothing that the vaccines for Pertussis and Smallpox were released prior to the first collection year of this dataset, so they are nto marked

All of the vaccines seem to have been released after the peak of the disease which may speak to the either the speed and efficiency of US Public Health, the perceived need, or the state of science at the time.

Exploratory Analysis & Data Visualization Part 1 - State to State Incidence

Now that we have a solid idea of what the trends look like on a national scale, we can zoom back in to look at individual states to try and see how they compare.

One great way to visualize disease incidence per state is to make a spatial heatmap of the US. Given any year and disease, we can use the Geopandas library to visualize the prevalence of the disease in each state.

Preparing a Dataset for Graphing

Cartographic Boundary Files 2012 - Shapefile of the United States: https://www.sciencebase.gov/catalog/item/52c78623e4b060b9ebca5be5

We originally got our data from the US Census, but there was too much data (each state had multiple coordinates for its regions). We found this dataset from sciencebase.gov, which is a U.S. Geological Survey Trusted Digital Repository. Their files are simpler and contain single coordinates for each state. We want only one row of data for each state so that we can easily join our data with our other datasets.

We followed these tutorials to implement our code:

https://mohammadimranhasan.com/geospatial-data-mapping-with-python/

https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d

First, we need to install geopandas in order to create maps and import our coordinates file.

In [138]:
!pip install geopandas
#!pip install --upgrade pyshp
#!pip install --upgrade shapely
#!pip install --upgrade descartes
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: geopandas in /usr/local/lib/python3.8/dist-packages (0.12.2)
Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.3.5)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from geopandas) (21.3)
Requirement already satisfied: shapely>=1.7 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.8.5.post1)
Requirement already satisfied: pyproj>=2.6.1.post1 in /usr/local/lib/python3.8/dist-packages (from geopandas) (3.4.1)
Requirement already satisfied: fiona>=1.8 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.8.22)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (22.1.0)
Requirement already satisfied: munch in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (57.4.0)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (1.15.0)
Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (7.1.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (2022.9.24)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (1.21.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (2022.6)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->geopandas) (3.0.9)

We need to read in the zip file.

In [139]:
import geopandas
states = geopandas.read_file("tl_2012_us_state.zip")

states.head()
Out[139]:
OBJECTID REGION DIVISION STATEFP STATENS GEOID STUSPS NAME LSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON Shape_Leng Shape_Area geometry
0 1 4 9 15 01779782 15 HI Hawaii 00 G4000 A 1.663425e+10 1.167874e+10 +19.8097670 -155.5061027 2.419034e+06 3.268482e+10 MULTIPOLYGON (((-17361760.204 2164544.588, -17...
1 2 3 7 05 00068085 05 AR Arkansas 00 G4000 A 1.347726e+11 2.959210e+09 +34.8955256 -092.4446262 2.656648e+06 2.053261e+11 POLYGON ((-10515267.713 4101325.818, -10515269...
2 3 4 8 35 00897535 35 NM New Mexico 00 G4000 A 3.141611e+11 7.564385e+08 +34.4346843 -106.1316181 2.900368e+06 4.643927e+11 POLYGON ((-12138963.727 4106855.170, -12138964...
3 4 4 8 30 00767982 30 MT Montana 00 G4000 A 3.769636e+11 3.868565e+09 +47.0511771 -109.6348174 4.607246e+06 8.201836e+11 POLYGON ((-12727480.235 5886876.586, -12727567...
4 5 1 2 36 01779796 36 NY New York 00 G4000 A 1.220579e+11 1.923885e+10 +42.9133974 -075.5962723 3.212630e+06 2.637702e+11 MULTIPOLYGON (((-8866092.533 5160809.769, -886...

Cleaning up the dataset

There are some columns we don't need, and we can rename the columns to make them look neater.

In [140]:
states.columns.unique() #grabbing column names
Out[140]:
Index(['OBJECTID', 'REGION', 'DIVISION', 'STATEFP', 'STATENS', 'GEOID',
       'STUSPS', 'NAME', 'LSAD', 'MTFCC', 'FUNCSTAT', 'ALAND', 'AWATER',
       'INTPTLAT', 'INTPTLON', 'Shape_Leng', 'Shape_Area', 'geometry'],
      dtype='object')
In [141]:
states.drop(columns=['OBJECTID', 'REGION', 'DIVISION', 'STATEFP', 'STATENS', 'GEOID',
       'STUSPS', 'LSAD', 'MTFCC', 'FUNCSTAT', 'ALAND', 'AWATER',
       'INTPTLAT', 'INTPTLON', 'Shape_Leng', 'Shape_Area'], inplace=True) #dropping unnecessary columns
states.rename(columns={'NAME':'State', 'geometry': 'Geometry'}, inplace=True) #renaming the columns to make them look prettier
states['State'] = states['State'].str.upper() #in order to join with previous datasets we want this to match
states.head()
Out[141]:
State Geometry
0 HAWAII MULTIPOLYGON (((-17361760.204 2164544.588, -17...
1 ARKANSAS POLYGON ((-10515267.713 4101325.818, -10515269...
2 NEW MEXICO POLYGON ((-12138963.727 4106855.170, -12138964...
3 MONTANA POLYGON ((-12727480.235 5886876.586, -12727567...
4 NEW YORK MULTIPOLYGON (((-8866092.533 5160809.769, -886...
In [142]:
states.shape[0]
Out[142]:
56

From the code above, we see that there are 56 rows, even though we know there are 50 states. By looking at the data, we see acquired regions of the US included, such as Puerto Rico and Guam. We need to keep this in mind for joining our data.

We need to check for NaN values.

In [143]:
states.isnull().sum()
Out[143]:
State       0
Geometry    0
dtype: int64

Perfect! We have no NaN values.

Now we check the dtypes. We expect State should be an object and Geometry should be a geometry (because of the way it was loaded with geopandas).

Just to test that we are using GeoPandas correctly, here is a simple map of the US.

In [144]:
states = geopandas.GeoDataFrame(states, geometry='Geometry') #turning the dataframe into a geodataframe object
states.plot(figsize=(10, 10))
plt.xlim(-2.25e7, -0.5e7)
plt.ylim(0.0, 1.2e7)
Out[144]:
(0.0, 12000000.0)

Great! We are good to move on with analysis. This work will be reffered to later in the seperate sections for national incidence of each disease

Incidence Column Function

This function takes in one of the generic disease dfs and adds on a column called incidence that calculates the fraction of new cases/total populateion for every state. It returns 2 things - a DataFrame called states_disease_incidence that has the incidence for every states every year for the given disease, and a dictionary called states_disease which has seperate Dataframes for each state with the year and case number for the given disease

In [145]:
#Incidence Column Function
def incidence_df(df):
  #create a dictionary of dataframs for every state
  states_disease = dict()
  for k, v in df.groupby('loc'):
    states_disease[k] = v

  #create an incidence column for every dataframe in the dictionary
  for i in states_disease:
    states_disease[i]['incidence'] = states_disease[i]['number']/states_disease[i]['population']
    #set the index to year for easy combining later
    states_disease[i].set_index('year', inplace = True)

  #create new dataframe of all of the incidences by state for every year
  states_disease_incidence = pd.DataFrame()

  #enter the incidence columns from the state datasets into the incidence dataset
  for i in states_disease:
    states_disease_incidence[i] = states_disease[i]['incidence']
  
  return states_disease_incidence, states_disease

Map DF function

This function takes in the states disease DataFrame created in the incidence df function and properly formats it for ease of graphing with Geopandas.

In [146]:
#create a DataFrame to make the visual map of incidence
def map_df(states_disease_incidence):
  map_disease_df = states_disease_incidence.T #transpose to get states in a column
  map_disease_df.reset_index(inplace=True)
  map_disease_df.rename(columns={'index':'State'}, inplace=True)
  map_disease_df = pd.merge(map_disease_df, states, on="State", how='left') #we want to use the keys of the incidence values because we know those are 50 states, not including Guam, etc.
  return map_disease_df

Create Map Function

The create_map function takes in 'sel', the selected year we want to examine in the form of an integer, 'map_disease_df', the DataFrame we created in the map_df function specifically for graphing, and the name of the disease in the form of a string to properly label the resulting grpah.

The colormapped scalebar on the right changes per disease, but does not change depending on the year chosen because it is based on the min and max values per disease.

In [147]:
def create_map(sel, map_disease_df, disease):
  map_disease_df = geopandas.GeoDataFrame(map_disease_df, geometry='Geometry') #turning the dataframe into a geodataframe object to be used with geopandas to plot

  #Here we are trying to find the min and max values to use for our scalebar
  #We don't want to do this for each year because looking at the graph over time would be misleading if you kept changing the scalebars
  map_disease_df.columns.unique()
  merged_cols = map_disease_df.columns.unique()

  merged_cols = merged_cols[1:-1] #only grabbing the years
  vmin = map_disease_df[merged_cols].min().min()
  vmax = map_disease_df[merged_cols].max().max()

  #Plotting
  fig, ax = plt.subplots(1, figsize=(10,7))
  map_disease_df.plot(column=sel, figsize=(10, 7), cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8')
  plt.xlim(-2.25e7, -0.5e7) #resetting axes
  plt.ylim(0.5, 1.2e7)
  ax.axis('off') #remove axis because it means nothing for us

  ax.set_title(f'Incidence Rate of {disease} in {sel} across the US')

  sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=vmin, vmax=vmax)) #for color scalebar
  sm._A = []
  cbar = fig.colorbar(sm)

Creating and Graphing Incidence

We will now use the 3 functions created above to visualize the state incidences for any given year

Measles

In [148]:
#utilize the above functions to create dictionary of state specific DataFrames
states_measles_incidence, states_measles = incidence_df(measles)
In [149]:
states_measles_incidence.head()
Out[149]:
ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
year
1928 0.003350 NaN 0.002007 0.004818 0.000692 0.002070 0.006350 0.002562 0.005357 0.001196 ... 0.001601 0.003154 0.000973 0.000169 0.003349 NaN 0.003448 0.001959 0.001246 0.002273
1929 0.001119 NaN 0.000549 0.000672 0.000728 0.000742 0.006148 0.002398 0.000942 0.000780 ... 0.001678 0.000330 0.000713 0.000689 0.001053 NaN 0.002486 0.003801 0.010165 0.003121
1930 0.001570 NaN 0.004664 0.000535 0.007602 0.011328 0.001122 0.001092 0.001822 0.003566 ... 0.003463 0.001799 0.000732 0.010448 0.002367 NaN 0.006316 0.001577 0.007485 0.003416
1931 0.003373 NaN 0.004977 0.000459 0.004775 0.004533 0.007905 0.010033 0.008329 0.002608 ... 0.002124 0.001348 0.000395 0.000298 0.003184 NaN 0.001974 0.002914 0.005066 0.000607
1932 0.000102 NaN 0.000202 0.000054 0.002141 0.002229 0.003483 0.000159 0.000532 0.000136 ... 0.000964 0.000690 0.000766 0.000139 0.011461 0.000534 0.006319 0.005997 0.009353 0.002422

5 rows × 51 columns

In [150]:
states_measles['ALABAMA']
Out[150]:
disease increase loc number population incidence
year
1928 MEASLES 334.99 ALABAMA 8843 2640000 3.349621e-03
1929 MEASLES 111.93 ALABAMA 2959 2644000 1.119138e-03
1930 MEASLES 157.00 ALABAMA 4156 2647000 1.570079e-03
1931 MEASLES 337.29 ALABAMA 8934 2649000 3.372593e-03
1932 MEASLES 10.21 ALABAMA 270 2653000 1.017716e-04
... ... ... ... ... ... ...
1991 MEASLES 0.02 ALABAMA 1 4099156 2.439527e-07
1993 MEASLES 0.02 ALABAMA 1 4214202 2.372928e-07
1997 MEASLES 0.14 ALABAMA 6 4367935 1.373647e-06
1998 MEASLES 0.00 ALABAMA 0 4404701 0.000000e+00
2002 MEASLES 0.16 ALABAMA 7 4472420 1.565148e-06

66 rows × 6 columns

In [151]:
map_measles_df = map_df(states_measles_incidence)
map_measles_df.head()
Out[151]:
State 1928 1929 1930 1931 1932 1933 1934 1935 1936 ... 1987 1988 1989 1990 1991 1993 1997 1998 2002 Geometry
0 ALABAMA 0.003350 0.001119 0.001570 0.003373 0.000102 0.000652 0.005903 0.002653 0.000208 ... 9.940743e-07 2.481253e-07 0.000015 0.000007 2.439527e-07 2.372928e-07 0.000001 0.000000e+00 0.000002 POLYGON ((-9841333.855 3579579.495, -9841349.6...
1 ALASKA NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 1.827622e-06 0.000000 0.000049 1.753792e-06 NaN 0.000002 2.580928e-05 NaN MULTIPOLYGON (((-15108567.654 8339788.089, -15...
2 ARIZONA 0.002007 0.000549 0.004664 0.004977 0.000202 0.002960 0.002388 0.001350 0.005255 ... 9.900478e-06 8.282294e-07 0.000040 0.000077 9.027138e-05 7.379275e-07 0.000002 2.047778e-06 NaN POLYGON ((-12761162.105 4147165.875, -12761214...
3 ARKANSAS 0.004818 0.000672 0.000535 0.000459 0.000054 0.002933 0.003846 0.000803 0.000057 ... NaN NaN 0.000002 0.000017 NaN NaN 0.000001 NaN NaN POLYGON ((-10515267.713 4101325.818, -10515269...
4 CALIFORNIA 0.000692 0.000728 0.007602 0.004775 0.002141 0.004453 0.004233 0.004664 0.007735 ... 2.719200e-05 2.515558e-05 0.000074 0.000320 6.442247e-05 2.014393e-06 0.000001 1.212574e-07 0.000000 MULTIPOLYGON (((-13060108.516 3854208.959, -13...

5 rows × 68 columns

In [152]:
create_map(1960, map_measles_df, 'Measles')

Polio

In [153]:
#utilize the above functions to create dictionary of state specific DataFrames
states_polio_incidence, states_polio = incidence_df(polio)
In [154]:
states_polio_incidence.head()
Out[154]:
ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
year
1928 0.000023 NaN 0.000026 0.000005 0.000051 0.000070 0.000046 0.000034 0.000069 0.000015 ... 0.000064 0.000013 0.000007 0.000018 0.000073 0.000006 0.000208 0.000114 0.000017 0.000027
1929 0.000022 NaN 0.000021 0.000002 0.000028 0.000013 0.000013 0.000021 0.000012 0.000024 ... 0.000004 0.000036 0.000003 0.000008 0.000053 0.000099 0.000017 0.000034 0.000007 0.000013
1930 0.000025 NaN 0.000074 0.000042 0.000307 0.000063 0.000046 0.000025 0.000018 0.000008 ... 0.000202 0.000023 0.000022 0.000006 0.000008 0.000005 0.000030 0.000024 0.000061 0.000243
1931 0.000020 NaN 0.000026 0.000009 0.000043 0.000009 0.000700 0.000021 0.000030 0.000011 ... 0.000042 0.000018 0.000006 0.000010 0.000248 0.000010 0.000054 0.000047 0.000246 0.000026
1932 0.000013 NaN 0.000016 0.000008 0.000029 0.000008 0.000020 0.000045 0.000066 0.000005 ... 0.000022 0.000020 0.000012 0.000006 0.000006 0.000022 0.000047 0.000025 0.000019 0.000026

5 rows × 51 columns

In [155]:
states_polio['ALABAMA']
Out[155]:
disease increase loc number population incidence
year
1928 POLIO 2.39 ALABAMA 62 2640000 2.348485e-05
1929 POLIO 2.25 ALABAMA 58 2644000 2.193646e-05
1930 POLIO 2.57 ALABAMA 67 2647000 2.531167e-05
1931 POLIO 2.07 ALABAMA 54 2649000 2.038505e-05
1932 POLIO 1.38 ALABAMA 35 2653000 1.319261e-05
1933 POLIO 1.12 ALABAMA 28 2661000 1.052236e-05
1934 POLIO 1.89 ALABAMA 50 2685000 1.862197e-05
1935 POLIO 2.27 ALABAMA 61 2719000 2.243472e-05
1936 POLIO 15.36 ALABAMA 420 2743000 1.531170e-04
1937 POLIO 3.07 ALABAMA 84 2762000 3.041274e-05
1938 POLIO 3.36 ALABAMA 96 2787000 3.444564e-05
1939 POLIO 1.98 ALABAMA 56 2814000 1.990050e-05
1940 POLIO 1.62 ALABAMA 51 2845000 1.792619e-05
1941 POLIO 27.43 ALABAMA 796 2902000 2.742936e-04
1942 POLIO 4.61 ALABAMA 139 2941000 4.726284e-05
1943 POLIO 1.27 ALABAMA 40 2902000 1.378360e-05
1944 POLIO 3.39 ALABAMA 98 2802000 3.497502e-05
1945 POLIO 5.00 ALABAMA 140 2775000 5.045045e-05
1946 POLIO 10.85 ALABAMA 318 2911000 1.092408e-04
1947 POLIO 2.18 ALABAMA 68 2942000 2.311353e-05
1948 POLIO 6.83 ALABAMA 208 2969000 7.005726e-05
1949 POLIO 8.55 ALABAMA 256 3000000 8.533333e-05
1950 POLIO 9.13 ALABAMA 279 3058000 9.123610e-05
1951 POLIO 22.15 ALABAMA 677 3059000 2.213142e-04
1952 POLIO 13.56 ALABAMA 419 3068000 1.365711e-04
1953 POLIO 14.93 ALABAMA 457 3053000 1.496888e-04
1954 POLIO 12.47 ALABAMA 376 3014000 1.247512e-04
1955 POLIO 5.52 ALABAMA 169 3050000 5.540984e-05
1956 POLIO 3.46 ALABAMA 105 3071000 3.419082e-05
1957 POLIO 1.85 ALABAMA 59 3109000 1.897716e-05
1958 POLIO 1.92 ALABAMA 62 3163000 1.960164e-05
1959 POLIO 4.79 ALABAMA 154 3204000 4.806492e-05
1960 POLIO 0.69 ALABAMA 23 3274000 7.025046e-06
1961 POLIO 0.33 ALABAMA 11 3316000 3.317250e-06
1962 POLIO 0.66 ALABAMA 22 3323000 6.620524e-06
1963 POLIO 1.59 ALABAMA 53 3358000 1.578320e-05
1964 POLIO 0.06 ALABAMA 2 3395000 5.891016e-07
1966 POLIO 0.00 ALABAMA 0 3464000 0.000000e+00
1968 POLIO 0.03 ALABAMA 1 3446000 2.901915e-07
In [156]:
map_polio_df = map_df(states_polio_incidence)
map_polio_df.head()
Out[156]:
State 1928 1929 1930 1931 1932 1933 1934 1935 1936 ... 1958 1959 1960 1961 1962 1963 1964 1966 1968 Geometry
0 ALABAMA 0.000023 0.000022 0.000025 0.000020 0.000013 0.000011 0.000019 0.000022 0.000153 ... 0.000020 0.000048 0.000007 0.000003 0.000007 0.000016 5.891016e-07 0.000000e+00 2.901915e-07 POLYGON ((-9841333.855 3579579.495, -9841349.6...
1 ALASKA NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.000009 0.000089 0.000009 NaN NaN NaN NaN NaN NaN MULTIPOLYGON (((-15108567.654 8339788.089, -15...
2 ARIZONA 0.000026 0.000021 0.000074 0.000026 0.000016 0.000038 0.000257 0.000048 0.000020 ... 0.000028 0.000049 0.000007 0.000006 0.000004 0.000002 NaN NaN 5.945303e-07 POLYGON ((-12761162.105 4147165.875, -12761214...
3 ARKANSAS 0.000005 0.000002 0.000042 0.000009 0.000008 0.000007 0.000007 0.000015 0.000030 ... 0.000024 0.000110 0.000023 0.000014 0.000014 0.000003 NaN 5.265929e-07 5.257624e-07 POLYGON ((-10515267.713 4101325.818, -10515269...
4 CALIFORNIA 0.000051 0.000028 0.000307 0.000043 0.000029 0.000025 0.000540 0.000126 0.000059 ... 0.000021 0.000022 0.000026 0.000006 0.000005 0.000001 1.101868e-07 1.590837e-07 5.671857e-07 MULTIPOLYGON (((-13060108.516 3854208.959, -13...

5 rows × 41 columns

In [157]:
create_map(1960, map_polio_df, 'Polio')

Pertussis

In [158]:
#utilize the above functions to create dictionary of state specific DataFrames
states_pertussis_incidence, states_pertussis = incidence_df(pertussis)
In [159]:
states_pertussis_incidence.head()
Out[159]:
ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
year
1938 0.000661 NaN 0.003489 0.000804 0.002271 0.001477 0.002166 0.002292 0.000925 0.000458 ... 0.001160 0.000736 0.001573 0.004725 0.006935 0.001475 0.002615 0.001649 0.004112 0.002272
1939 0.000702 NaN 0.002138 0.000483 0.001245 0.001817 0.002378 0.002897 0.002271 0.000607 ... 0.000419 0.000785 0.000870 0.005843 0.008475 0.001275 0.000945 0.000809 0.003520 0.001907
1940 0.000434 NaN 0.001788 0.000415 0.001823 0.001107 0.001799 0.002264 0.001094 0.000349 ... 0.000284 0.000722 0.001278 0.007196 0.003793 0.001134 0.001164 0.000966 0.002249 0.000828
1941 0.000540 NaN 0.002400 0.000568 0.002622 0.004182 0.001591 0.001004 0.000937 0.000371 ... 0.000954 0.000865 0.001620 0.005018 0.001828 0.001467 0.002582 0.001201 0.002861 0.001534
1942 0.000558 NaN 0.002048 0.000464 0.001556 0.001626 0.002181 0.000545 0.001090 0.000381 ... 0.000411 0.000656 0.001155 0.002614 0.006061 0.000847 0.001579 0.000765 0.003420 0.001382

5 rows × 51 columns

In [160]:
states_pertussis['ALABAMA']
Out[160]:
disease increase loc number population incidence
year
1938 PERTUSSIS 66.01 ALABAMA 1843 2787000 6.612845e-04
1939 PERTUSSIS 69.98 ALABAMA 1975 2814000 7.018479e-04
1940 PERTUSSIS 43.18 ALABAMA 1235 2845000 4.340949e-04
1941 PERTUSSIS 53.93 ALABAMA 1566 2902000 5.396278e-04
1942 PERTUSSIS 55.64 ALABAMA 1640 2941000 5.576335e-04
1943 PERTUSSIS 70.66 ALABAMA 2054 2902000 7.077877e-04
1944 PERTUSSIS 46.83 ALABAMA 1314 2802000 4.689507e-04
1945 PERTUSSIS 38.78 ALABAMA 1081 2775000 3.895495e-04
1946 PERTUSSIS 35.99 ALABAMA 1048 2911000 3.600137e-04
1947 PERTUSSIS 76.40 ALABAMA 2249 2942000 7.644460e-04
1948 PERTUSSIS 42.48 ALABAMA 1263 2969000 4.253958e-04
1949 PERTUSSIS 17.20 ALABAMA 515 3000000 1.716667e-04
1950 PERTUSSIS 38.47 ALABAMA 1177 3058000 3.848921e-04
1951 PERTUSSIS 39.51 ALABAMA 1209 3059000 3.952272e-04
1952 PERTUSSIS 30.46 ALABAMA 945 3068000 3.080183e-04
1953 PERTUSSIS 10.38 ALABAMA 317 3053000 1.038323e-04
1954 PERTUSSIS 24.79 ALABAMA 748 3014000 2.481752e-04
1955 PERTUSSIS 58.12 ALABAMA 1774 3050000 5.816393e-04
1974 PERTUSSIS 0.52 ALABAMA 19 3678814 5.164708e-06
1975 PERTUSSIS 1.24 ALABAMA 46 3735139 1.231547e-05
1976 PERTUSSIS 0.17 ALABAMA 6 3780403 1.587132e-06
1977 PERTUSSIS 0.11 ALABAMA 4 3831836 1.043886e-06
1978 PERTUSSIS 0.08 ALABAMA 3 3866248 7.759461e-07
1979 PERTUSSIS 0.12 ALABAMA 4 3893888 1.027251e-06
1980 PERTUSSIS 0.11 ALABAMA 4 3918531 1.020791e-06
1982 PERTUSSIS 0.14 ALABAMA 5 3934102 1.270938e-06
1983 PERTUSSIS 0.11 ALABAMA 4 3951820 1.012192e-06
1984 PERTUSSIS 0.03 ALABAMA 1 3972523 2.517292e-07
1985 PERTUSSIS 0.77 ALABAMA 29 3991569 7.265313e-06
1986 PERTUSSIS 0.52 ALABAMA 24 4015264 5.977191e-06
1987 PERTUSSIS 0.59 ALABAMA 26 4023844 6.461483e-06
1988 PERTUSSIS 1.48 ALABAMA 62 4030222 1.538377e-05
1989 PERTUSSIS 2.32 ALABAMA 97 4040587 2.400641e-05
1990 PERTUSSIS 1.77 ALABAMA 76 4050055 1.876518e-05
1991 PERTUSSIS 1.53 ALABAMA 65 4099156 1.585692e-05
1992 PERTUSSIS 0.51 ALABAMA 24 4154014 5.777544e-06
1993 PERTUSSIS 1.54 ALABAMA 67 4214202 1.589862e-05
1994 PERTUSSIS 0.88 ALABAMA 39 4260229 9.154437e-06
1995 PERTUSSIS 0.88 ALABAMA 38 4296800 8.843791e-06
1996 PERTUSSIS 0.61 ALABAMA 28 4331102 6.464867e-06
1997 PERTUSSIS 0.81 ALABAMA 37 4367935 8.470822e-06
1998 PERTUSSIS 0.55 ALABAMA 24 4404701 5.448724e-06
1999 PERTUSSIS 0.45 ALABAMA 20 4430141 4.514529e-06
2000 PERTUSSIS 0.35 ALABAMA 17 4451849 3.818638e-06
2001 PERTUSSIS 0.73 ALABAMA 35 4464034 7.840442e-06
2002 PERTUSSIS 0.78 ALABAMA 37 4472420 8.272926e-06
2003 PERTUSSIS 0.28 ALABAMA 14 4490591 3.117630e-06
2004 PERTUSSIS 0.89 ALABAMA 41 4512190 9.086497e-06
2005 PERTUSSIS 1.76 ALABAMA 82 4545049 1.804161e-05
2006 PERTUSSIS 1.94 ALABAMA 90 4597688 1.957506e-05
2007 PERTUSSIS 1.54 ALABAMA 72 4637904 1.552425e-05
2008 PERTUSSIS 1.05 ALABAMA 50 4677464 1.068955e-05
2009 PERTUSSIS 3.49 ALABAMA 167 4708708 3.546620e-05
2010 PERTUSSIS 2.34 ALABAMA 113 4729656 2.389180e-05
2011 PERTUSSIS 2.16 ALABAMA 106 4802740 2.207073e-05
In [161]:
map_pertussis_df = map_df(states_pertussis_incidence)
map_pertussis_df.head()
Out[161]:
State 1938 1939 1940 1941 1942 1943 1944 1945 1946 ... 2003 2004 2005 2006 2007 2008 2009 2010 2011 Geometry
0 ALABAMA 0.000661 0.000702 0.000434 0.000540 0.000558 0.000708 0.000469 0.000390 0.000360 ... 0.000003 0.000009 0.000018 0.000020 0.000016 0.000011 0.000035 0.000024 0.000022 POLYGON ((-9841333.855 3579579.495, -9841349.6...
1 ALASKA NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.000009 0.000017 0.000184 0.000078 0.000054 0.000278 0.000100 0.000059 0.000037 MULTIPOLYGON (((-15108567.654 8339788.089, -15...
2 ARIZONA 0.003489 0.002138 0.001788 0.002400 0.002048 0.001366 0.001598 0.001436 0.001135 ... 0.000016 0.000036 0.000159 0.000057 0.000031 0.000016 0.000023 0.000027 0.000041 POLYGON ((-12761162.105 4147165.875, -12761214...
3 ARKANSAS 0.000804 0.000483 0.000415 0.000568 0.000464 0.000794 0.000552 0.000392 0.000271 ... 0.000011 0.000028 0.000104 0.000021 0.000030 0.000020 0.000036 0.000023 0.000017 POLYGON ((-10515267.713 4101325.818, -10515269...
4 CALIFORNIA 0.002271 0.001245 0.001823 0.002622 0.001556 0.001501 0.000620 0.001200 0.000424 ... 0.000018 0.000014 0.000045 0.000023 0.000003 0.000008 0.000011 0.000034 0.000030 MULTIPOLYGON (((-13060108.516 3854208.959, -13...

5 rows × 57 columns

In [162]:
create_map(2011, map_pertussis_df, 'Pertussis')

Smallpox

In [163]:
#utilize the above functions to create dictionary of state specific DataFrames
states_smallpox_incidence, states_smallpox = incidence_df(smallpox)
In [164]:
states_smallpox_incidence.head()
Out[164]:
ALABAMA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA GEORGIA ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
year
1928 0.000129 0.000972 0.000134 0.000185 0.000335 0.000102 0.000009 0.000021 0.000110 0.000005 ... 0.000511 0.000258 0.000316 0.000821 0.000028 8.213552e-07 0.001031 0.000767 0.000283 0.000732
1929 0.000143 0.000642 0.000107 0.000363 0.000837 0.000027 0.000008 0.000000 0.000025 0.000019 ... 0.001587 0.000094 0.000424 0.000327 0.000387 2.597938e-05 0.001329 0.000351 0.000232 0.001498
1930 0.000073 0.001263 0.000221 0.000402 0.000540 0.000000 0.000000 0.000000 0.000019 0.000000 ... 0.002007 0.000162 0.000468 0.000053 0.000219 4.120313e-06 0.001496 0.000451 0.000308 0.001111
1931 0.000111 0.000107 0.000407 0.000239 0.000251 0.000061 0.000000 0.000000 0.000019 0.000016 ... 0.000964 0.000133 0.000305 0.000033 0.000663 4.907975e-06 0.000657 0.000144 0.000124 0.000236
1932 0.000176 0.000016 0.000196 0.000084 0.000057 0.000053 0.000000 0.000002 0.000022 0.000005 ... 0.000228 0.000162 0.000179 0.000008 0.000486 4.078303e-06 0.000420 0.000035 0.000029 0.000100

5 rows × 49 columns

In [165]:
states_smallpox['ALABAMA']
Out[165]:
disease increase loc number population incidence
year
1928 SMALLPOX 12.96 ALABAMA 341 2640000 1.291667e-04
1929 SMALLPOX 14.33 ALABAMA 378 2644000 1.429652e-04
1930 SMALLPOX 7.28 ALABAMA 192 2647000 7.253495e-05
1931 SMALLPOX 11.18 ALABAMA 295 2649000 1.113628e-04
1932 SMALLPOX 17.64 ALABAMA 467 2653000 1.760271e-04
1933 SMALLPOX 3.14 ALABAMA 82 2661000 3.081548e-05
1934 SMALLPOX 0.87 ALABAMA 23 2685000 8.566108e-06
1935 SMALLPOX 1.55 ALABAMA 42 2719000 1.544686e-05
1936 SMALLPOX 0.45 ALABAMA 12 2743000 4.374772e-06
1937 SMALLPOX 1.97 ALABAMA 54 2762000 1.955105e-05
1938 SMALLPOX 2.55 ALABAMA 74 2787000 2.655185e-05
1939 SMALLPOX 1.74 ALABAMA 52 2814000 1.847903e-05
1940 SMALLPOX 1.65 ALABAMA 49 2845000 1.722320e-05
1941 SMALLPOX 0.34 ALABAMA 12 2902000 4.135079e-06
1942 SMALLPOX 0.41 ALABAMA 13 2941000 4.420265e-06
1943 SMALLPOX 0.61 ALABAMA 20 2902000 6.891799e-06
1944 SMALLPOX 0.42 ALABAMA 12 2802000 4.282655e-06
1945 SMALLPOX 0.20 ALABAMA 5 2775000 1.801802e-06
1946 SMALLPOX 0.16 ALABAMA 5 2911000 1.717623e-06
1947 SMALLPOX 0.18 ALABAMA 6 2942000 2.039429e-06
1948 SMALLPOX 0.45 ALABAMA 15 2969000 5.052206e-06
1951 SMALLPOX 0.03 ALABAMA 1 3059000 3.269042e-07
In [166]:
map_smallpox_df = map_df(states_smallpox_incidence)
map_smallpox_df.head()
Out[166]:
State 1928 1929 1930 1931 1932 1933 1934 1935 1936 ... 1941 1942 1943 1944 1945 1946 1947 1948 1951 Geometry
0 ALABAMA 0.000129 0.000143 0.000073 0.000111 0.000176 0.000031 0.000009 0.000015 0.000004 ... 0.000004 4.420265e-06 6.891799e-06 0.000004 1.801802e-06 0.000002 2.039429e-06 0.000005 3.269042e-07 POLYGON ((-9841333.855 3579579.495, -9841349.6...
1 ARIZONA 0.000972 0.000642 0.001263 0.000107 0.000016 0.000023 0.000007 0.000005 0.000007 ... 0.000033 9.541985e-06 8.670520e-06 0.000011 5.050505e-06 0.000003 0.000000e+00 0.000003 3.821656e-06 POLYGON ((-12761162.105 4147165.875, -12761214...
2 ARKANSAS 0.000134 0.000107 0.000221 0.000407 0.000196 0.000124 0.000057 0.000032 0.000007 ... 0.000017 3.136065e-05 3.092784e-05 0.000012 1.759364e-05 0.000006 2.178649e-06 0.000011 NaN POLYGON ((-10515267.713 4101325.818, -10515269...
3 CALIFORNIA 0.000185 0.000363 0.000402 0.000239 0.000084 0.000178 0.000030 0.000048 0.000018 ... 0.000002 9.049774e-07 4.702563e-07 0.000002 5.351027e-07 0.000001 1.017087e-07 0.000004 NaN MULTIPOLYGON (((-13060108.516 3854208.959, -13...
4 COLORADO 0.000335 0.000837 0.000540 0.000251 0.000057 0.000156 0.000143 0.000160 0.000180 ... 0.000045 1.886792e-05 1.040763e-05 0.000011 5.376344e-06 0.000011 8.084074e-07 0.000013 NaN POLYGON ((-12139399.612 4944791.641, -12139399...

5 rows × 24 columns

In [167]:
create_map(1940, map_smallpox_df, 'Smallpox')

Hepatitis A

In [168]:
#utilize the above functions to create dictionary of state specific DataFrames
states_hepa_incidence, states_hepa = incidence_df(hepa)
In [169]:
states_hepa_incidence.head()
Out[169]:
ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
year
1966 0.000093 0.000332 0.000257 0.000157 0.000315 0.000145 0.000115 0.000153 0.000048 0.000106 ... 0.000020 0.000261 0.000172 0.000168 0.000056 0.000130 0.000189 0.000137 0.000088 0.000167
1967 0.000084 0.000140 0.000237 0.000155 0.000390 0.000143 0.000199 0.000330 0.000053 0.000103 ... 0.000019 0.000217 0.000257 0.000155 0.000035 0.000157 0.000277 0.000205 0.000104 0.000202
1968 0.000091 0.000295 0.000253 0.000087 0.000558 0.000337 0.000169 0.000273 0.000086 0.000173 ... 0.000115 0.000267 0.000202 0.000252 0.000156 0.000127 0.000364 0.000189 0.000125 0.000114
1969 0.000110 0.000213 0.000304 0.000099 0.000459 0.000271 0.000206 0.000124 0.000076 0.000204 ... 0.000129 0.000260 0.000209 0.000271 0.000341 0.000093 0.000377 0.000194 0.000095 0.000246
1970 0.000118 0.000358 0.000324 0.000107 0.000463 0.000397 0.000281 0.000239 0.000116 0.000252 ... 0.000031 0.000316 0.000238 0.000282 0.000808 0.000285 0.000399 0.000216 0.000108 0.000244

5 rows × 51 columns

In [170]:
states_hepa['ALABAMA']
Out[170]:
disease increase loc number population incidence
year
1966 HEPATITIS A 9.28 ALABAMA 321 3464000 0.000093
1967 HEPATITIS A 8.42 ALABAMA 291 3458000 0.000084
1968 HEPATITIS A 9.18 ALABAMA 314 3446000 0.000091
1969 HEPATITIS A 11.02 ALABAMA 380 3440000 0.000110
1970 HEPATITIS A 11.80 ALABAMA 413 3497076 0.000118
1971 HEPATITIS A 10.66 ALABAMA 378 3539400 0.000107
1972 HEPATITIS A 9.57 ALABAMA 342 3579780 0.000096
1973 HEPATITIS A 12.93 ALABAMA 467 3626499 0.000129
1974 HEPATITIS A 6.58 ALABAMA 244 3678814 0.000066
1975 HEPATITIS A 7.65 ALABAMA 286 3735139 0.000077
1976 HEPATITIS A 5.86 ALABAMA 220 3780403 0.000058
1977 HEPATITIS A 5.38 ALABAMA 206 3831836 0.000054
1978 HEPATITIS A 5.29 ALABAMA 203 3866248 0.000053
1979 HEPATITIS A 6.64 ALABAMA 257 3893888 0.000066
1980 HEPATITIS A 5.09 ALABAMA 200 3918531 0.000051
1981 HEPATITIS A 3.59 ALABAMA 138 3925266 0.000035
1982 HEPATITIS A 3.33 ALABAMA 128 3934102 0.000033
1983 HEPATITIS A 4.56 ALABAMA 178 3951820 0.000045
1984 HEPATITIS A 2.97 ALABAMA 115 3972523 0.000029
1985 HEPATITIS A 1.01 ALABAMA 37 3991569 0.000009
1986 HEPATITIS A 1.18 ALABAMA 51 4015264 0.000013
1987 HEPATITIS A 1.91 ALABAMA 81 4023844 0.000020
1988 HEPATITIS A 0.30 ALABAMA 13 4030222 0.000003
1989 HEPATITIS A 1.29 ALABAMA 56 4040587 0.000014
1990 HEPATITIS A 2.05 ALABAMA 86 4050055 0.000021
1991 HEPATITIS A 0.87 ALABAMA 39 4099156 0.000010
1992 HEPATITIS A 0.81 ALABAMA 35 4154014 0.000008
1993 HEPATITIS A 0.94 ALABAMA 40 4214202 0.000009
1994 HEPATITIS A 1.68 ALABAMA 72 4260229 0.000017
1995 HEPATITIS A 1.77 ALABAMA 75 4296800 0.000017
1996 HEPATITIS A 4.55 ALABAMA 196 4331102 0.000045
1997 HEPATITIS A 0.90 ALABAMA 40 4367935 0.000009
1998 HEPATITIS A 1.47 ALABAMA 64 4404701 0.000015
1999 HEPATITIS A 1.17 ALABAMA 52 4430141 0.000012
2000 HEPATITIS A 1.15 ALABAMA 53 4451849 0.000012
2001 HEPATITIS A 1.41 ALABAMA 66 4464034 0.000015
2002 HEPATITIS A 0.75 ALABAMA 35 4472420 0.000008
2003 HEPATITIS A 0.28 ALABAMA 14 4490591 0.000003
2004 HEPATITIS A 0.12 ALABAMA 6 4512190 0.000001
2005 HEPATITIS A 0.71 ALABAMA 34 4545049 0.000007
2006 HEPATITIS A 0.35 ALABAMA 17 4597688 0.000004
2007 HEPATITIS A 0.41 ALABAMA 20 4637904 0.000004
2008 HEPATITIS A 0.20 ALABAMA 10 4677464 0.000002
2009 HEPATITIS A 0.20 ALABAMA 10 4708708 0.000002
2010 HEPATITIS A 0.16 ALABAMA 8 4729656 0.000002
2011 HEPATITIS A 0.14 ALABAMA 7 4802740 0.000001
In [171]:
map_hepa_df = map_df(states_hepa_incidence)
map_hepa_df.head()
Out[171]:
State 1966 1967 1968 1969 1970 1971 1972 1973 1974 ... 2003 2004 2005 2006 2007 2008 2009 2010 2011 Geometry
0 ALABAMA 0.000093 0.000084 0.000091 0.000110 0.000118 0.000107 0.000096 0.000129 0.000066 ... 0.000003 0.000001 0.000007 0.000004 0.000004 0.000002 0.000002 0.000002 0.000001 POLYGON ((-9841333.855 3579579.495, -9841349.6...
1 ALASKA 0.000332 0.000140 0.000295 0.000213 0.000358 0.000296 0.000602 0.000660 0.000659 ... 0.000012 0.000006 0.000004 NaN 0.000004 0.000004 0.000009 0.000006 0.000001 MULTIPOLYGON (((-15108567.654 8339788.089, -15...
2 ARIZONA 0.000257 0.000237 0.000253 0.000304 0.000324 0.000446 0.000311 0.000072 0.000224 ... 0.000042 0.000036 0.000040 0.000011 0.000026 0.000016 0.000011 0.000011 0.000003 POLYGON ((-12761162.105 4147165.875, -12761214...
3 ARKANSAS 0.000157 0.000155 0.000087 0.000099 0.000107 0.000117 0.000135 0.000067 0.000206 ... 0.000006 0.000019 0.000006 0.000008 0.000004 0.000001 0.000002 0.000000 0.000000 POLYGON ((-10515267.713 4101325.818, -10515269...
4 CALIFORNIA 0.000315 0.000390 0.000558 0.000459 0.000463 0.000467 0.000345 0.000325 0.000258 ... 0.000030 0.000022 0.000025 0.000019 0.000009 0.000009 0.000006 0.000005 0.000005 MULTIPOLYGON (((-13060108.516 3854208.959, -13...

5 rows × 48 columns

In [172]:
create_map(2011, map_hepa_df, 'Pertussis')

Rubella

In [173]:
#utilize the above functions to create dictionary of state specific DataFrames
states_rubella_incidence, states_rubella = incidence_df(rubella)
In [174]:
states_rubella_incidence.head()
Out[174]:
ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
year
1966 0.000032 0.000413 0.001595 0.000005 0.000149 0.000370 0.000756 0.000103 0.000019 0.000231 ... 0.000003 0.000676 0.000013 0.000080 0.000385 0.000216 0.001024 0.000573 0.001194 0.000740
1967 0.000062 0.001137 0.000593 0.000009 0.000473 0.000882 0.000622 0.000141 0.000009 0.000184 ... 0.000004 0.000333 0.000059 0.000080 0.000506 0.000144 0.000955 0.000334 0.000713 0.000012
1968 0.000117 0.001014 0.000416 0.000002 0.000252 0.000418 0.001025 0.000281 0.000018 0.000229 ... NaN 0.000293 0.000265 0.000107 0.000212 0.000140 0.000561 0.000499 0.000686 0.000037
1969 0.000040 0.001574 0.000471 0.000104 0.000306 0.000635 0.000548 0.000365 0.000209 0.000299 ... NaN 0.000428 0.000378 0.000146 0.000275 0.000334 0.000540 0.001348 0.000720 0.000246
1970 0.000109 0.000269 0.000328 0.000019 0.000258 0.000188 0.000203 0.000081 0.000028 0.000466 ... 0.000006 0.000381 0.000723 0.000157 0.000150 0.000162 0.001415 0.000800 0.000514 0.000400

5 rows × 51 columns

In [175]:
states_rubella['ALABAMA']
Out[175]:
disease increase loc number population incidence
year
1966 RUBELLA 3.25 ALABAMA 112 3464000 3.233256e-05
1967 RUBELLA 6.22 ALABAMA 214 3458000 6.188548e-05
1968 RUBELLA 11.75 ALABAMA 404 3446000 1.172374e-04
1969 RUBELLA 3.99 ALABAMA 136 3440000 3.953488e-05
1970 RUBELLA 10.84 ALABAMA 380 3497076 1.086622e-04
1971 RUBELLA 6.41 ALABAMA 226 3539400 6.385263e-05
1972 RUBELLA 1.95 ALABAMA 68 3579780 1.899558e-05
1973 RUBELLA 6.14 ALABAMA 220 3626499 6.066457e-05
1974 RUBELLA 1.78 ALABAMA 65 3678814 1.766874e-05
1975 RUBELLA 0.64 ALABAMA 23 3735139 6.157736e-06
1976 RUBELLA 0.06 ALABAMA 2 3780403 5.290441e-07
1977 RUBELLA 2.85 ALABAMA 109 3831836 2.844589e-05
1978 RUBELLA 0.82 ALABAMA 31 3866248 8.018110e-06
1979 RUBELLA 1.17 ALABAMA 44 3893888 1.129976e-05
1980 RUBELLA 0.12 ALABAMA 4 3918531 1.020791e-06
1981 RUBELLA 0.03 ALABAMA 1 3925266 2.547598e-07
1983 RUBELLA 0.03 ALABAMA 1 3951820 2.530480e-07
1984 RUBELLA 0.08 ALABAMA 3 3972523 7.551876e-07
1989 RUBELLA 0.02 ALABAMA 1 4040587 2.474888e-07
1990 RUBELLA 0.02 ALABAMA 1 4050055 2.469102e-07
1996 RUBELLA 0.05 ALABAMA 2 4331102 4.617762e-07
1998 RUBELLA 0.00 ALABAMA 0 4404701 0.000000e+00
1999 RUBELLA 0.02 ALABAMA 1 4430141 2.257264e-07
2000 RUBELLA 0.07 ALABAMA 3 4451849 6.738773e-07
In [176]:
map_rubella_df = map_df(states_rubella_incidence)
map_rubella_df.head()
Out[176]:
State 1966 1967 1968 1969 1970 1971 1972 1973 1974 ... 1981 1983 1984 1989 1990 1996 1998 1999 2000 Geometry
0 ALABAMA 0.000032 0.000062 0.000117 0.000040 0.000109 0.000064 0.000019 0.000061 1.766874e-05 ... 2.547598e-07 2.530480e-07 7.551876e-07 2.474888e-07 2.469102e-07 4.617762e-07 0.000000e+00 2.257264e-07 6.738773e-07 POLYGON ((-9841333.855 3579579.495, -9841349.6...
1 ALASKA 0.000413 0.001137 0.001014 0.001574 0.000269 0.000126 0.000085 0.000085 NaN ... 2.224170e-06 1.946654e-06 1.877952e-06 NaN NaN NaN NaN NaN NaN MULTIPOLYGON (((-15108567.654 8339788.089, -15...
2 ARIZONA 0.001595 0.000593 0.000416 0.000471 0.000328 0.000200 0.000173 0.000012 8.753321e-07 ... 7.612823e-06 2.608297e-06 1.256464e-06 NaN 8.414545e-06 2.180103e-07 4.095556e-07 2.189568e-06 1.935473e-07 POLYGON ((-12761162.105 4147165.875, -12761214...
3 ARKANSAS 0.000005 0.000009 0.000002 0.000104 0.000019 0.000167 0.000035 0.000049 3.704517e-06 ... 3.051097e-06 NaN 1.289188e-06 4.254007e-07 1.273028e-06 NaN NaN 7.541876e-07 NaN POLYGON ((-10515267.713 4101325.818, -10515269...
4 CALIFORNIA 0.000149 0.000473 0.000252 0.000306 0.000258 0.000407 0.000230 0.000119 6.639475e-05 ... 1.913779e-05 1.029237e-05 6.089003e-06 3.797040e-06 1.552095e-05 1.499118e-06 6.062870e-08 1.492573e-07 3.235811e-07 MULTIPOLYGON (((-13060108.516 3854208.959, -13...

5 rows × 26 columns

In [177]:
create_map(1970, map_rubella_df, 'Rubella')

Mumps

In [178]:
#utilize the above functions to create dictionary of state specific DataFrames
states_mumps_incidence, states_mumps = incidence_df(mumps)
In [179]:
states_mumps_incidence.head()
Out[179]:
ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
year
1968 0.000159 0.002519 0.001197 0.000010 0.000871 0.001325 0.001294 0.000566 0.000303 0.000347 ... NaN 0.001532 0.001039 0.000675 0.002898 0.000278 0.002006 0.001976 0.003547 0.001231
1969 0.000068 0.003108 0.001210 0.000047 0.000407 0.000397 0.001328 0.000241 0.000121 0.000248 ... NaN 0.000585 0.000731 0.000319 0.002872 0.000280 0.001302 0.001574 0.001948 0.000024
1970 0.000112 0.000919 0.000521 0.000079 0.000306 0.000666 0.001192 0.000648 0.000291 0.000406 ... 0.000086 0.000823 0.000519 0.000117 0.001620 0.000469 0.001691 0.001432 0.002543 0.000129
1971 0.000290 0.000287 0.000620 0.000078 0.000365 0.000656 0.000516 0.000321 0.000132 0.000384 ... 0.000444 0.001245 0.000750 0.000172 0.000764 0.000222 0.001847 0.001431 0.003726 0.001226
1972 0.000237 0.000771 0.000459 0.000105 0.000341 0.000331 0.000365 0.000306 0.000042 0.000180 ... 0.000177 0.000512 0.000410 0.000121 0.000292 0.000268 0.001139 0.001411 0.001797 0.000983

5 rows × 51 columns

In [180]:
states_mumps['ALABAMA']
Out[180]:
disease increase loc number population incidence
year
1968 MUMPS 15.94 ALABAMA 548 3446000 1.590250e-04
1969 MUMPS 6.85 ALABAMA 234 3440000 6.802326e-05
1970 MUMPS 11.20 ALABAMA 392 3497076 1.120936e-04
1971 MUMPS 28.97 ALABAMA 1025 3539400 2.895971e-04
1972 MUMPS 23.72 ALABAMA 849 3579780 2.371654e-04
1973 MUMPS 20.85 ALABAMA 755 3626499 2.081898e-04
1974 MUMPS 16.66 ALABAMA 613 3678814 1.666298e-04
1975 MUMPS 12.17 ALABAMA 453 3735139 1.212806e-04
1976 MUMPS 10.30 ALABAMA 389 3780403 1.028991e-04
1977 MUMPS 7.54 ALABAMA 288 3831836 7.515979e-05
1978 MUMPS 11.30 ALABAMA 437 3866248 1.130295e-04
1979 MUMPS 0.77 ALABAMA 28 3893888 7.190756e-06
1980 MUMPS 0.92 ALABAMA 34 3918531 8.676721e-06
1981 MUMPS 0.51 ALABAMA 19 3925266 4.840436e-06
1982 MUMPS 0.29 ALABAMA 10 3934102 2.541876e-06
1983 MUMPS 0.06 ALABAMA 2 3951820 5.060959e-07
1984 MUMPS 0.15 ALABAMA 5 3972523 1.258646e-06
1985 MUMPS 0.03 ALABAMA 1 3991569 2.505281e-07
1986 MUMPS 0.08 ALABAMA 4 4015264 9.961985e-07
1987 MUMPS 1.52 ALABAMA 64 4023844 1.590519e-05
1988 MUMPS 0.41 ALABAMA 18 4030222 4.466255e-06
1989 MUMPS 0.66 ALABAMA 30 4040587 7.424664e-06
1990 MUMPS 0.42 ALABAMA 19 4050055 4.691294e-06
1991 MUMPS 0.29 ALABAMA 14 4099156 3.415337e-06
1992 MUMPS 0.25 ALABAMA 12 4154014 2.888772e-06
1993 MUMPS 0.49 ALABAMA 22 4214202 5.220443e-06
1994 MUMPS 0.29 ALABAMA 12 4260229 2.816750e-06
1995 MUMPS 0.04 ALABAMA 2 4296800 4.654627e-07
1996 MUMPS 0.15 ALABAMA 7 4331102 1.616217e-06
1997 MUMPS 0.16 ALABAMA 8 4367935 1.831529e-06
1998 MUMPS 0.08 ALABAMA 4 4404701 9.081207e-07
1999 MUMPS 0.24 ALABAMA 11 4430141 2.482991e-06
2000 MUMPS 0.08 ALABAMA 4 4451849 8.985031e-07
2002 MUMPS 0.04 ALABAMA 2 4472420 4.471852e-07
In [181]:
map_mumps_df = map_df(states_mumps_incidence)
map_mumps_df.head()
Out[181]:
State 1968 1969 1970 1971 1972 1973 1974 1975 1976 ... 1993 1994 1995 1996 1997 1998 1999 2000 2002 Geometry
0 ALABAMA 0.000159 0.000068 0.000112 0.000290 0.000237 0.000208 0.000167 0.000121 0.000103 ... 0.000005 0.000003 4.654627e-07 1.616217e-06 0.000002 9.081207e-07 0.000002 8.985031e-07 4.471852e-07 POLYGON ((-9841333.855 3579579.495, -9841349.6...
1 ALASKA 0.002519 0.003108 0.000919 0.000287 0.000771 0.002557 0.000396 0.000130 0.000094 ... 0.000018 0.000007 1.985401e-05 4.929597e-06 0.000007 3.226160e-06 0.000005 1.274902e-05 NaN MULTIPOLYGON (((-15108567.654 8339788.089, -15...
2 ARIZONA 0.001197 0.001210 0.000521 0.000620 0.000459 0.000074 NaN NaN NaN ... 0.000005 0.000023 1.353638e-06 2.180103e-07 0.000007 1.228667e-06 0.000002 1.161284e-06 0.000000e+00 POLYGON ((-12761162.105 4147165.875, -12761214...
3 ARKANSAS 0.000010 0.000047 0.000079 0.000078 0.000105 0.000160 0.000085 0.000085 0.000037 ... 0.000002 0.000002 3.944152e-06 3.887860e-07 0.000000 3.426889e-06 NaN 1.866864e-06 NaN POLYGON ((-10515267.713 4101325.818, -10515269...
4 CALIFORNIA 0.000871 0.000407 0.000306 0.000365 0.000341 0.000268 0.000116 0.000170 0.000079 ... 0.000005 0.000005 6.467574e-06 3.185625e-06 0.000004 1.940119e-06 0.000002 1.853237e-06 1.892408e-06 MULTIPOLYGON (((-13060108.516 3854208.959, -13...

5 rows × 36 columns

In [182]:
create_map(1970, map_mumps_df, 'Mumps')

Exploratory Analysis & Data Visualization Part 2 - Incidence Analysis

Now that we have an understanding and datasets portraying incidence on a national and state wide scale, we can see how the states compare to the national average and establish a ranking of what states were affected by a given disease the most each year.

We will start by creating a dataset that has the incidence for every state every year as well as the entire country.

In [183]:
def US_state_df(disease_incidence, states_disease_incidence):
  #reset index to year for joining
  disease_incidence.set_index('year', inplace = True)

  #grab the national incidence and add it into a new dataframe with state incidences
  US = disease_incidence['incidence']
  US_state_disease_incidence = states_disease_incidence.join(US)

  #rename column 
  US_state_disease_incidence.rename(columns={'incidence':'UNITED STATES'}, inplace = True)

  return US_state_disease_incidence
In [184]:
def graph_state_incidence(US_state_disease_incidence):

  states = list(US_state_disease_incidence.columns)
  states.remove('UNITED STATES')

  disease_compare_incidence = US_state_disease_incidence[states].div(US_state_disease_incidence['UNITED STATES'], axis=0)

  disease_hist = pd.DataFrame()

  for state in disease_compare_incidence:
    count = disease_compare_incidence[state][disease_compare_incidence[state] > 1].count()
    disease_hist.insert(0, state, [count])

  disease_hist = disease_hist.T
  disease_hist.sort_values(0, ascending = False, inplace = True)

  disease_hist.plot.bar(title='Number Of Times State Incidence is Above National Average Incidence', figsize=(20, 10), legend=False)

Measles

In [185]:
US_state_measles_incidence = US_state_df(measles_incidence, states_measles_incidence)
graph_state_incidence(US_state_measles_incidence)

Polio

In [186]:
US_state_polio_incidence = US_state_df(polio_incidence, states_polio_incidence)
graph_state_incidence(US_state_polio_incidence)

Pertussis

In [187]:
US_state_pertussis_incidence = US_state_df(pertussis_incidence, states_pertussis_incidence)
graph_state_incidence(US_state_pertussis_incidence)

Smallpox

In [188]:
US_state_smallpox_incidence = US_state_df(smallpox_incidence, states_smallpox_incidence)
graph_state_incidence(US_state_smallpox_incidence)

Hepatitis A

In [189]:
US_state_hepa_incidence = US_state_df(hepa_incidence, states_hepa_incidence)
graph_state_incidence(US_state_hepa_incidence)

Rubella

In [190]:
US_state_rubella_incidence = US_state_df(rubella_incidence, states_rubella_incidence)
graph_state_incidence(US_state_rubella_incidence)

Mumps

In [191]:
US_state_mumps_incidence = US_state_df(mumps_incidence, states_mumps_incidence)
graph_state_incidence(US_state_mumps_incidence)

Data Collection and Processing (ETL) - Population Density and Income Data

In [192]:
df_density = pd.read_csv("../finalProject/apportionment.csv", sep=",")

df_density.head()

#grab column names for easy transformation
df_density.columns
#['Name', 'Geography Type', 'Year', 'Resident Population',
       #'Percent Change in Resident Population', 'Resident Population Density',
       #'Resident Population Density Rank', 'Number of Representatives',
       #'Change in Number of Representatives',
       #'Average Apportionment Population Per Representative']

#first we need to get rid of the unnecessary information just to make things cleaner
df_density = df_density.drop(columns=['Resident Population', 'Resident Population Density Rank',
                                      'Percent Change in Resident Population', 
                                      'Number of Representatives',	'Change in Number of Representatives', 
                                      'Average Apportionment Population Per Representative'])
df_density
Out[192]:
Name Geography Type Year Resident Population Density
0 Alabama State 1910 42.2
1 Alaska State 1910 0.1
2 Arizona State 1910 1.8
3 Arkansas State 1910 30.3
4 California State 1910 15.3
... ... ... ... ...
679 Midwest Region Region 2020 NaN
680 Northeast Region Region 2020 NaN
681 South Region Region 2020 NaN
682 West Region Region 2020 NaN
683 United States Nation 2020 93.8

684 rows × 4 columns

There are also some values within the dataframe that we don't need (at least for initial use with our first dataset) such as the regions and the nationwide information. To change this we can just grab the rows with the type "State"

In [193]:
pop_density = df_density[df_density['Geography Type'] == 'State']
pop_density = pop_density.drop(pop_density[pop_density['Name'] == 'Puerto Rico'].index)
pop_density
Out[193]:
Name Geography Type Year Resident Population Density
0 Alabama State 1910 42.2
1 Alaska State 1910 0.1
2 Arizona State 1910 1.8
3 Arkansas State 1910 30.3
4 California State 1910 15.3
... ... ... ... ...
674 Virginia State 2020 218.6
675 Washington State 2020 115.9
676 West Virginia State 2020 74.6
677 Wisconsin State 2020 108.8
678 Wyoming State 2020 5.9

612 rows × 4 columns

In [194]:
pop_density[pop_density['Name'] == 'Alabama']
Out[194]:
Name Geography Type Year Resident Population Density
0 Alabama State 1910 42.2
57 Alabama State 1920 46.4
114 Alabama State 1930 52.3
171 Alabama State 1940 55.9
228 Alabama State 1950 60.5
285 Alabama State 1960 64.5
342 Alabama State 1970 68
399 Alabama State 1980 76.9
456 Alabama State 1990 79.8
513 Alabama State 2000 87.8
570 Alabama State 2010 94.4
627 Alabama State 2020 99.2
In [195]:
#check the datatypes of the dataset
pop_density.dtypes
Out[195]:
Name                           object
Geography Type                 object
Year                            int64
Resident Population Density    object
dtype: object
In [196]:
#change the resident population density to a float
pop_density['Resident Population Density'] = pd.to_numeric(pop_density['Resident Population Density'],errors='coerce')

pop_density['Resident Population Density'].describe()
Out[196]:
count    592.000000
mean     122.555912
std      175.061531
min        0.100000
25%       23.450000
50%       56.900000
75%      138.000000
max      974.700000
Name: Resident Population Density, dtype: float64
In [197]:
#We also need to make the states uppercase for easu combining later

pop_density['Name'] = pop_density['Name'].str.upper()

pop_density.head()
Out[197]:
Name Geography Type Year Resident Population Density
0 ALABAMA State 1910 42.2
1 ALASKA State 1910 0.1
2 ARIZONA State 1910 1.8
3 ARKANSAS State 1910 30.3
4 CALIFORNIA State 1910 15.3

The final step to combining this with the initial disease dataset is to ensure that all of the years present in the disease dataset also exist in the population density dataset. This data is grabbed from the census so there are only reports for the 1st year of every decade. While not exact, we can fill in the expected density for each year based on the values provided for each decade. This is done by creating rows for each year and then using interpolate to fill in NaN values by approximating them.

In [198]:
#create list of years needed to fill in the population density data frame
years = list(range(1911, 2020, 1))


years.remove(1920)
years.remove(1930)
years.remove(1940)
years.remove(1950)
years.remove(1960)
years.remove(1970)
years.remove(1980)
years.remove(1990)
years.remove(2000)

years

#itereate over populaiton density to create and fill in a dataframe for each state
states_density = dict()
for k, v in pop_density.groupby('Name'):
  states_density[k] = v
  for year in years:
    states_density[k] = states_density[k].append({'Name':k,	'Geography Type':'State',	'Year':year}, ignore_index=True)
  states_density[k].sort_values(by = 'Year', inplace = True)
In [199]:
#use interpolate to approximate population density for each year in each state. 
for frame in states_density:
    states_density[frame].interpolate(method='linear', column = 'Resident Population Density', inplace=True)

#As an example for what is being stored, here are 1 DataFrame from states_density
states_density['WISCONSIN']
Out[199]:
Name Geography Type Year Resident Population Density
0 WISCONSIN State 1910 43.10
12 WISCONSIN State 1911 43.65
13 WISCONSIN State 1912 44.20
14 WISCONSIN State 1913 44.75
15 WISCONSIN State 1914 45.30
... ... ... ... ...
108 WISCONSIN State 2016 107.28
109 WISCONSIN State 2017 107.66
110 WISCONSIN State 2018 108.04
111 WISCONSIN State 2019 108.42
11 WISCONSIN State 2020 108.80

112 rows × 4 columns

In order to be more workable, we need the state population densities to be in one large dataframe instead of a dictionary of dataframes. We wish to find a correlation between disease incidence and population density.

We hypothesize that higher population density would indicate a higher incidence as disease can mroe easily spread.

In [200]:
#put all state population density into one large dataframe
years = states_density['ALABAMA']['Year']
list_of_names = states_density.keys()
res_pop_dens = pd.DataFrame()
for name in list_of_names:
  res_pop_dens[name] = states_density[name]["Resident Population Density"]
res_pop_dens.insert(0, 'Year', years)

#set the index to year for easy accessing
res_pop_dens.set_index('Year', inplace=True)

Because we need to find the correlation for every state for every disease, it is best to write a function to calculate and append this information for us

In [201]:
def correlation_function(states, states_disease_incidence, res_pop_dens, name):
  #create empty array to store correlation data
  data = []
  for i in states: #calculates the correlation between incidence of disease and average population density for each state
    data.append(states_disease_incidence[[i]][i].corr(res_pop_dens[[i]][i]))
  #add this data into the larger Dataframe
  correl_density_df[name] = data
In [202]:
#Making correlation density dataframes per disease into 1 big dataframe
all_states = res_pop_dens.columns.unique()
correl_density_df = pd.DataFrame(columns = ['Location']) #creating a dataframe of 2 columns
correl_density_df['Location'] = all_states #filling in location with state names

#measles
correlation_function(all_states, states_measles_incidence, res_pop_dens, 'Correlation for Measles')

#Hepatitis A
correlation_function(all_states, states_hepa_incidence, res_pop_dens, 'Correlation for Hepatitis A')

#Mumps
correlation_function(all_states, states_mumps_incidence, res_pop_dens, 'Correlation for Mumps')

#Pertussis
correlation_function(all_states, states_pertussis_incidence, res_pop_dens, 'Correlation for Pertussis')

#Polio
correlation_function(all_states, states_polio_incidence, res_pop_dens, 'Correlation for Polio')

#Rubella
correlation_function(all_states, states_rubella_incidence, res_pop_dens, 'Correlation for Rubella')
correl_density_df.head()
Out[202]:
Location Correlation for Measles Correlation for Hepatitis A Correlation for Mumps Correlation for Pertussis Correlation for Polio Correlation for Rubella
0 ALABAMA -0.564066 -0.862142 -0.760208 -0.820631 -0.042730 -0.691614
1 ALASKA -0.625996 -0.430000 -0.586364 -0.591051 -0.411121 -0.632208
2 ARIZONA -0.556976 -0.649866 -0.636456 -0.783381 -0.082528 -0.478183
3 ARKANSAS -0.374926 -0.700985 -0.378505 -0.711537 0.159685 -0.384886
4 CALIFORNIA -0.681544 -0.894775 -0.650937 -0.718160 -0.214112 -0.684958
In [203]:
#report the maximum and minimum correlation states for measles as well as the average
print('Max correlation for measles: ', correl_density_df['Location'].iloc[correl_density_df['Correlation for Measles'].idxmax()])
print(states_density['WEST VIRGINIA'])
print('Min correlation for measles: ', correl_density_df['Location'].iloc[correl_density_df['Correlation for Measles'].idxmin()])
print(states_density['NEW YORK'])
print('Avg correlation for measles: ', correl_density_df['Correlation for Measles'].mean())
Max correlation for measles:  WEST VIRGINIA
              Name Geography Type  Year  Resident Population Density
0    WEST VIRGINIA          State  1910                        50.80
12   WEST VIRGINIA          State  1911                        51.81
13   WEST VIRGINIA          State  1912                        52.82
14   WEST VIRGINIA          State  1913                        53.83
15   WEST VIRGINIA          State  1914                        54.84
..             ...            ...   ...                          ...
108  WEST VIRGINIA          State  2016                        75.60
109  WEST VIRGINIA          State  2017                        75.35
110  WEST VIRGINIA          State  2018                        75.10
111  WEST VIRGINIA          State  2019                        74.85
11   WEST VIRGINIA          State  2020                        74.60

[112 rows x 4 columns]
Min correlation for measles:  NEW YORK
         Name Geography Type  Year  Resident Population Density
0    NEW YORK          State  1910                       193.40
12   NEW YORK          State  1911                       196.10
13   NEW YORK          State  1912                       198.80
14   NEW YORK          State  1913                       201.50
15   NEW YORK          State  1914                       204.20
..        ...            ...   ...                          ...
108  NEW YORK          State  2016                       421.70
109  NEW YORK          State  2017                       423.45
110  NEW YORK          State  2018                       425.20
111  NEW YORK          State  2019                       426.95
11   NEW YORK          State  2020                       428.70

[112 rows x 4 columns]
Avg correlation for measles:  -0.5378643108469423

The average correlation between state population and incidence for that state for measles is -.53 which indicates a moderate negative correlation.

Our hypothesis was disproved as it would appear that as population density increases, incidence shows a moderate decrease. This could be due to bias on our assumption that all disease spread primarily through close contact with humans as well as an oversimplification of what variables may determine contraction.

The maximum correlation existed for West Virginia, which when compared to the minimum correlation of New York, has a considerably lower average population density. This indicates that perhaps there are other variables at play that may have some slight correlation to population density but do not have the same trends and hence lead to variable correlation values. This could be access to health care and abilty to be diagnosed and treated which can vary widely between states like WV and NY where city centers and population distribution appear very different.

Income Level

We will attempt to investigate whether average income level by state has any correlation with the disease prevalence by state.

Data source: https://apps.bea.gov/iTable/?reqid=70&step=1&acrdn=2#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyNCwyOSwyNSwzMSwyNiwyNywzMF0sImRhdGEiOltbIlRhYmxlSWQiLCIyMSJdLFsiQ2xhc3NpZmljYXRpb24iLCJOb24tSW5kdXN0cnkiXSxbIk1ham9yX0FyZWEiLCIwIl0sWyJTdGF0ZSIsWyIwIl1dLFsiQXJlYSIsWyJYWCJdXSxbIlN0YXRpc3RpYyIsWyIxIl1dLFsiVW5pdF9vZl9tZWFzdXJlIiwiTGV2ZWxzIl0sWyJZZWFyIixbIi0xIl1dLFsiWWVhckJlZ2luIiwiLTEiXSxbIlllYXJfRW5kIiwiLTEiXV19

This data source comes from the U.S. Bureau of Economic Analysis (BEA). It contains average incomes for each state over the years 1929 to 2021.

In [204]:
income_df = pd.read_csv("../finalProject/income.csv", sep=",", skiprows = 3)
#reading in income file as a dataframe
#we need to skip the first 3 rows because there are some text explanations written in these rows

income_df.head()
#original dataframe
Out[204]:
GeoFips GeoName 1929 1930 1931 1932 1933 1934 1935 1936 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 00000 United States 85151.0 76394.0 65531.0 50162.0 47114.0 53967.0 60704.0 69063.0 ... 14003346.0 14189228.0 14969527.0 15681233.0 16092713.0 16837337.0 17671054.0 18575467.0 19812171.0 21288709.0
1 01000 Alabama 843.2 697.5 583.7 421.9 435.6 556.3 586.9 683.2 ... 173361.7 175185.3 181078.8 189115.4 192363.5 199191.8 206712.0 216587.5 232040.3 250828.8
2 02000 Alaska * (NA) (NA) (NA) (NA) (NA) (NA) (NA) (NA) ... 39266.0 38978.0 41210.0 42555.1 41837.4 42431.4 43981.2 45056.8 45965.1 48219.2
3 04000 Arizona 257.3 224.7 184.5 137.0 131.8 154.9 180.4 204.8 ... 237809.1 243752.3 257272.3 270815.5 282085.1 299249.2 316896.3 340259.8 375601.3 403739.3
4 05000 Arkansas 560.4 416.5 381.8 278.5 283.2 341.8 386.0 460.2 ... 107877.3 108367.8 114743.3 118719.5 121825.9 125285.9 130033.4 133161.4 142038.5 153185.8

5 rows × 95 columns

Data Filtering

We do not have a use for the GeoFips column, which is just the geo-codes corresponding to each state. We also need to get rid of the footnote rows at the bottom of the document.

The datasource notes that estimates prior to 1950 are not available for Alaska and Hawaii.

In [205]:
import numpy as np
income_df
#income_df.drop(income_df[(income_df['GeoFips'] == object)].index, inplace=True)

income_df.drop(income_df.index[list(range(60, 65))], inplace=True)
#We only want to keep the first 60 rows because there are unnecessary lines at the end.

income_df.drop(columns = ["GeoFips"], inplace=True)
#We don't care about the GeoFips column

#We want to change the names that have asterisks to make the data look nicer
index_alaska = income_df.index[income_df['GeoName'] == 'Alaska *'].tolist()[0]
index_hawaii = income_df.index[income_df['GeoName'] == 'Hawaii *'].tolist()[0]
income_df.at[index_alaska, 'GeoName'] = 'Alaska'
income_df.at[index_hawaii, 'GeoName'] = 'Hawaii'

#income_df['GeoName'].loc[income_df['GeoName'].isin(['Alaska *'])] = 'Alaska'
#income_df['GeoName'].loc[income_df['GeoName'].isin(['Hawaii *'])] = 'Hawaii'

income_df.replace('(NA)', np.nan, inplace=True)
#Replacing NA values with NaN

income_df.isnull().sum()[income_df.isnull().sum() > 2]
#Since we know from the data source that 2 states do not have values before 1950, we check to see if there are any other missing values from the other 48 states. 
#We see that we return an empty series, so that means only two states had NaN's like we expected.
Out[205]:
Series([], dtype: int64)

Now we display the resulting dataset after this cleanup:

In [206]:
income_df.head() #resulting dataset
Out[206]:
GeoName 1929 1930 1931 1932 1933 1934 1935 1936 1937 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 United States 85151.0 76394.0 65531.0 50162.0 47114.0 53967.0 60704.0 69063.0 74556.0 ... 14003346.0 14189228.0 14969527.0 15681233.0 16092713.0 16837337.0 17671054.0 18575467.0 19812171.0 21288709.0
1 Alabama 843.2 697.5 583.7 421.9 435.6 556.3 586.9 683.2 733.7 ... 173361.7 175185.3 181078.8 189115.4 192363.5 199191.8 206712.0 216587.5 232040.3 250828.8
2 Alaska NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 39266.0 38978.0 41210.0 42555.1 41837.4 42431.4 43981.2 45056.8 45965.1 48219.2
3 Arizona 257.3 224.7 184.5 137.0 131.8 154.9 180.4 204.8 229.0 ... 237809.1 243752.3 257272.3 270815.5 282085.1 299249.2 316896.3 340259.8 375601.3 403739.3
4 Arkansas 560.4 416.5 381.8 278.5 283.2 341.8 386.0 460.2 481.6 ... 107877.3 108367.8 114743.3 118719.5 121825.9 125285.9 130033.4 133161.4 142038.5 153185.8

5 rows × 94 columns

Now we need to complete some transformations to prepare the dataset for analysis.

In [207]:
income_df1 = income_df.set_index('GeoName') #reset index to be by state
income_dfT1 = income_df1.T #transpose
income_dfT1.reset_index(inplace=True) #adding year back as a column to use for plotting
income_dfT1.columns = income_dfT1.columns.str.upper() #need these to be uppercase to join with population density dataset
income_dfT1.rename(columns={'INDEX':'year'}, inplace=True) #renaming column to make more clear
income_dfT1.drop(['NEW ENGLAND', 'MIDEAST', 'GREAT LAKES', 'PLAINS', 'SOUTHEAST', 'SOUTHWEST', 'ROCKY MOUNTAIN', 'FAR WEST'], axis=1, inplace=True) #we want to drop these columns so that they don't mess up any averaging
income_dfT1.head()
Out[207]:
GeoName year UNITED STATES ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
0 1929 85151.0 843.2 NaN 257.3 560.4 5495.5 635.6 1640.5 243.4 ... 286.2 971.2 2728.7 278.2 226.3 1054.5 1150.5 787.4 1966.8 149.9
1 1930 76394.0 697.5 NaN 224.7 416.5 5079.4 596.8 1494.7 205.1 ... 247.3 841.9 2381.2 251.6 206.0 938.4 1030.7 706.1 1724.7 131.6
2 1931 65531.0 583.7 NaN 184.5 381.8 4387.5 498.5 1313.1 187.7 ... 167.8 724.9 2038.5 190.8 169.2 906.9 844.7 616.3 1402.6 109.3
3 1932 50162.0 421.9 NaN 137.0 278.5 3441.4 380.3 1026.9 144.9 ... 130.9 526.2 1560.7 157.4 129.9 701.9 637.4 448.2 1094.0 86.4
4 1933 47114.0 435.6 NaN 131.8 283.2 3277.3 378.4 970.3 140.4 ... 93.9 552.6 1530.0 154.6 120.4 704.4 598.8 453.0 1014.1 85.5

5 rows × 53 columns

We need to check that the datatypes are correct.

In [208]:
income_dfT1.dtypes
Out[208]:
GeoName
year                    object
UNITED STATES           object
ALABAMA                 object
ALASKA                  object
ARIZONA                 object
ARKANSAS                object
CALIFORNIA              object
COLORADO                object
CONNECTICUT             object
DELAWARE                object
DISTRICT OF COLUMBIA    object
FLORIDA                 object
GEORGIA                 object
HAWAII                  object
IDAHO                   object
ILLINOIS                object
INDIANA                 object
IOWA                    object
KANSAS                  object
KENTUCKY                object
LOUISIANA               object
MAINE                   object
MARYLAND                object
MASSACHUSETTS           object
MICHIGAN                object
MINNESOTA               object
MISSISSIPPI             object
MISSOURI                object
MONTANA                 object
NEBRASKA                object
NEVADA                  object
NEW HAMPSHIRE           object
NEW JERSEY              object
NEW MEXICO              object
NEW YORK                object
NORTH CAROLINA          object
NORTH DAKOTA            object
OHIO                    object
OKLAHOMA                object
OREGON                  object
PENNSYLVANIA            object
RHODE ISLAND            object
SOUTH CAROLINA          object
SOUTH DAKOTA            object
TENNESSEE               object
TEXAS                   object
UTAH                    object
VERMONT                 object
VIRGINIA                object
WASHINGTON              object
WEST VIRGINIA           object
WISCONSIN               object
WYOMING                 object
dtype: object

Oops! These are all objects, when we expect integers for years and floats for the income amounts.

In [209]:
cols_income = income_dfT1.columns
income_dfT1[cols_income] = income_dfT1[cols_income].apply(pd.to_numeric, errors='coerce') #changing datatypes from objects to floats/integers where appropriate

income_dfT1['ARIZONA'] #checking datatype
Out[209]:
0        257.3
1        224.7
2        184.5
3        137.0
4        131.8
        ...   
88    299249.2
89    316896.3
90    340259.8
91    375601.3
92    403739.3
Name: ARIZONA, Length: 93, dtype: float64
In [210]:
income_dfT1.dtypes
Out[210]:
GeoName
year                      int64
UNITED STATES           float64
ALABAMA                 float64
ALASKA                  float64
ARIZONA                 float64
ARKANSAS                float64
CALIFORNIA              float64
COLORADO                float64
CONNECTICUT             float64
DELAWARE                float64
DISTRICT OF COLUMBIA    float64
FLORIDA                 float64
GEORGIA                 float64
HAWAII                  float64
IDAHO                   float64
ILLINOIS                float64
INDIANA                 float64
IOWA                    float64
KANSAS                  float64
KENTUCKY                float64
LOUISIANA               float64
MAINE                   float64
MARYLAND                float64
MASSACHUSETTS           float64
MICHIGAN                float64
MINNESOTA               float64
MISSISSIPPI             float64
MISSOURI                float64
MONTANA                 float64
NEBRASKA                float64
NEVADA                  float64
NEW HAMPSHIRE           float64
NEW JERSEY              float64
NEW MEXICO              float64
NEW YORK                float64
NORTH CAROLINA          float64
NORTH DAKOTA            float64
OHIO                    float64
OKLAHOMA                float64
OREGON                  float64
PENNSYLVANIA            float64
RHODE ISLAND            float64
SOUTH CAROLINA          float64
SOUTH DAKOTA            float64
TENNESSEE               float64
TEXAS                   float64
UTAH                    float64
VERMONT                 float64
VIRGINIA                float64
WASHINGTON              float64
WEST VIRGINIA           float64
WISCONSIN               float64
WYOMING                 float64
dtype: object

Great! All fixed to be as expected.

Analysis

Now we plot income level by state over time just to get a quick visual of the data.

We only plot the max 7 states because 50 states would give us a whacky legend and a difficult graph to read.

In [211]:
#We sort the max income values series in descending order to grab the max 7 states, and then grab the names of those states
income_dfT1.max().sort_values(axis=0, ascending=False)[1:8].index.tolist() #Rows 0 is the United States, but we only want the max 10 states

#Then we plot
income_dfT1.plot(x = 'year', y=income_dfT1.max().sort_values(axis=0, ascending=False)[1:8].index.tolist()) 
plt.legend(income_dfT1.max().sort_values(axis=0, ascending=False)[1:8].index.tolist());
plt.title("Income Level Over Time per State")
plt.xlabel('Year')
plt.ylabel('Income')
Out[211]:
Text(0, 0.5, 'Income')

The above plot is just useful to note that income generally increased extremely over time for each state, and it is evident that some states have significantly more average income than others, such as California.

It is important to note that the cost of living is not the same in each state as well.

In [212]:
income_dfT1.set_index('year', inplace=True)
income_dfT1.head()
Out[212]:
GeoName UNITED STATES ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE DISTRICT OF COLUMBIA ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
year
1929 85151.0 843.2 NaN 257.3 560.4 5495.5 635.6 1640.5 243.4 639.3 ... 286.2 971.2 2728.7 278.2 226.3 1054.5 1150.5 787.4 1966.8 149.9
1930 76394.0 697.5 NaN 224.7 416.5 5079.4 596.8 1494.7 205.1 641.4 ... 247.3 841.9 2381.2 251.6 206.0 938.4 1030.7 706.1 1724.7 131.6
1931 65531.0 583.7 NaN 184.5 381.8 4387.5 498.5 1313.1 187.7 631.7 ... 167.8 724.9 2038.5 190.8 169.2 906.9 844.7 616.3 1402.6 109.3
1932 50162.0 421.9 NaN 137.0 278.5 3441.4 380.3 1026.9 144.9 567.9 ... 130.9 526.2 1560.7 157.4 129.9 701.9 637.4 448.2 1094.0 86.4
1933 47114.0 435.6 NaN 131.8 283.2 3277.3 378.4 970.3 140.4 500.2 ... 93.9 552.6 1530.0 154.6 120.4 704.4 598.8 453.0 1014.1 85.5

5 rows × 52 columns

In [213]:
all_states = states_measles_incidence.columns.unique() #grabbing the names of all the states

We wish to explore the relationship between incidence and average income per state.

We can calculate the correlation of the two for each state over all of the years that we have data available for, for each disease separately.

In [214]:
correl_income_df = pd.DataFrame(columns = ['Location']) #creating a dataframe of 2 columns
correl_income_df['Location'] = all_states #filling in location with state names

incidences = [states_measles_incidence, states_hepa_incidence, states_mumps_incidence, states_pertussis_incidence, states_polio_incidence, states_rubella_incidence, states_smallpox_incidence]

#Creating a function to automate the process of iterating through each disease
def correl_income_df_func(incidence, dis_name):
  meas_corr_data_func = []
  for i in all_states: #calculates the correlation between incidence of measles and average income of the state and appends to an array which we will throw into the dataframe later
    meas_corr_data_func.append(incidence[[i]][i].corr(income_dfT1[[i]][i]))
  correl_income_df[f'Correlation for {dis_name}'] = meas_corr_data_func #Creating a new column for this disease and throw in the correlation data

#Filling in the resultant correlation dataframe
correl_income_df_func(states_measles_incidence, 'Measles')
correl_income_df_func(states_hepa_incidence, 'Hepatitis A')
correl_income_df_func(states_mumps_incidence, 'Mumps')
correl_income_df_func(states_pertussis_incidence, 'Pertussis')
correl_income_df_func(states_polio_incidence, 'Polio')
correl_income_df_func(states_rubella_incidence, 'Rubella')

correl_income_df
Out[214]:
Location Correlation for Measles Correlation for Hepatitis A Correlation for Mumps Correlation for Pertussis Correlation for Polio Correlation for Rubella
0 ALABAMA -0.438864 -0.787450 -0.673812 -0.619376 -0.182555 -0.553569
1 ALASKA -0.504059 -0.421837 -0.557576 -0.223854 -0.281991 -0.580474
2 ARIZONA -0.471913 -0.687155 -0.526443 -0.605962 -0.138901 -0.372119
3 ARKANSAS -0.374091 -0.685318 -0.443986 -0.603096 0.024448 -0.342586
4 CALIFORNIA -0.491681 -0.840580 -0.561661 -0.494432 -0.292186 -0.597785
5 COLORADO -0.473118 -0.790049 -0.520779 -0.477964 -0.030515 -0.475452
6 CONNECTICUT -0.494054 -0.676581 -0.587232 -0.566260 -0.196509 -0.491188
7 DELAWARE -0.352530 -0.691063 -0.706025 -0.448722 0.016508 -0.387056
8 DISTRICT OF COLUMBIA -0.430254 -0.381729 -0.315241 -0.562689 -0.055079 -0.272922
9 FLORIDA -0.393899 -0.701293 -0.611818 -0.532889 0.012449 -0.530016
10 GEORGIA -0.320447 -0.688778 -0.263598 -0.559281 -0.015368 -0.351515
11 HAWAII -0.444390 -0.669737 -0.517535 -0.203510 -0.416792 -0.426629
12 IDAHO -0.460630 -0.692686 -0.566743 -0.471422 0.077795 -0.487007
13 ILLINOIS -0.471243 -0.758112 -0.707377 -0.510833 -0.119096 -0.629050
14 INDIANA -0.471012 -0.784077 -0.626504 -0.644863 -0.023369 -0.595665
15 IOWA -0.418800 -0.603993 -0.603113 -0.464908 0.149623 -0.573497
16 KANSAS -0.466032 -0.741988 -0.662507 -0.538337 0.096425 -0.429905
17 KENTUCKY -0.439927 -0.750518 -0.641025 -0.635103 0.041686 -0.559005
18 LOUISIANA -0.447005 -0.810822 -0.254834 -0.463114 0.056014 -0.485844
19 MAINE -0.479206 -0.516104 -0.567805 -0.633887 -0.046865 -0.556266
20 MARYLAND -0.383021 -0.663977 -0.563129 -0.560587 -0.059396 -0.474399
21 MASSACHUSETTS -0.553791 -0.489817 -0.459569 -0.559990 -0.107367 -0.560347
22 MICHIGAN -0.500868 -0.737273 -0.675314 -0.609570 -0.119926 -0.729918
23 MINNESOTA -0.373499 -0.697321 -0.530679 -0.444325 -0.070244 -0.499605
24 MISSISSIPPI -0.377433 -0.732160 -0.561017 -0.385160 0.055086 -0.373902
25 MISSOURI -0.439282 -0.600504 -0.655783 -0.521782 0.021800 -0.447154
26 MONTANA -0.500560 -0.726915 -0.627714 -0.438677 0.012123 -0.584399
27 NEBRASKA -0.398446 -0.537230 -0.632497 -0.498075 0.176869 -0.289057
28 NEVADA -0.312422 -0.669026 -0.402603 -0.305455 0.116253 -0.503912
29 NEW HAMPSHIRE -0.384313 -0.542423 -0.553658 -0.561887 0.138748 -0.474458
30 NEW JERSEY -0.519256 -0.699206 -0.600998 -0.593741 -0.213640 -0.563064
31 NEW MEXICO -0.452374 -0.796322 -0.514237 -0.517054 -0.087182 -0.526129
32 NEW YORK -0.575878 -0.654657 -0.635314 -0.550308 -0.229062 -0.525775
33 NORTH CAROLINA -0.316167 -0.606512 -0.338804 -0.457097 -0.043573 -0.249987
34 NORTH DAKOTA -0.433816 -0.499573 -0.555261 -0.477584 0.070357 -0.664824
35 OHIO -0.500542 -0.804153 -0.703410 -0.555504 -0.090723 -0.687021
36 OKLAHOMA -0.422502 -0.475569 -0.528256 -0.584851 -0.033737 -0.382634
37 OREGON -0.523861 -0.796372 -0.553439 -0.599129 0.101379 -0.675984
38 PENNSYLVANIA -0.490885 -0.650329 -0.602147 -0.572429 -0.132532 -0.438337
39 RHODE ISLAND -0.394296 -0.617608 -0.605285 -0.621437 0.111882 -0.334214
40 SOUTH CAROLINA -0.392597 -0.645628 -0.478741 -0.499401 -0.039939 -0.537443
41 SOUTH DAKOTA -0.265881 -0.321733 -0.391740 -0.422224 0.151902 0.120758
42 TENNESSEE -0.398904 -0.701850 -0.613782 -0.597027 0.003533 -0.582133
43 TEXAS -0.452337 -0.842644 -0.603819 -0.575867 0.044653 -0.408034
44 UTAH -0.353990 -0.652629 -0.518282 -0.423923 0.116972 -0.612660
45 VERMONT -0.401079 -0.493054 -0.474270 -0.595264 -0.091076 -0.390664
46 VIRGINIA -0.473576 -0.611603 -0.616531 -0.568281 -0.071741 -0.546190
47 WASHINGTON -0.518176 -0.715003 -0.589229 -0.436118 -0.179919 -0.477765
48 WEST VIRGINIA -0.513411 -0.737397 -0.782945 -0.668567 0.002412 -0.677053
49 WISCONSIN -0.547927 -0.723717 -0.628660 -0.543277 0.078701 -0.739234
50 WYOMING -0.391032 -0.619877 -0.521698 -0.547756 0.161935 -0.331384

We get a negative correlation each time as our result!

This makes sense because this is saying that for a higher income, the chances of contracting measles is less. Vice versa, for a lower income, the chances of contracting measles is higher. This does not establish causality, but it certainly displays an expected relationship!

Now we can compute some stats about our table.

In [215]:
print('Max correlation for measles: ', correl_income_df['Location'].iloc[correl_income_df['Correlation for Measles'].idxmax()])
print('Min correlation for measles: ', correl_income_df['Location'].iloc[correl_income_df['Correlation for Measles'].idxmin()])
print('Avg correlation for measles: ', correl_income_df['Correlation for Measles'].mean())
Max correlation for measles:  SOUTH DAKOTA
Min correlation for measles:  NEW YORK
Avg correlation for measles:  -0.4393191688736428

It is interesting to point out that the average correlation over all of the states for the entire period for measles is just under 50%.

We note that we know we cannot make strong claims on this data because income also varies between state based on average living costs.

In addition, we recognize it would have been better to have data for income level of each person who got sick since this income set is just the average income of the state including people who aren't sick. This is just to explore a relationship between overall state wealth and illness.

We can compute these stats for the other diseases too.

Analysis, Hypothesis Testing, & ML - Predictive Models

Our idea for a predictive model is to create a hypothetical situation and predict the chances that someone in a particular state would contract a certain disease in a couple of years after.

For instance, we could say, "Imagine you are living in Louisiana in 1955, and you want to know the chances of you catching Measles in 1967."

The independent variables used to train the model would be the year, state, disease, and income level and the dependent variable to be predicted on would be the probability that this person will contract the disease at hand.

In terms of the methodology to go about this, we can train a k-nearest neighbors regression model to predict the probability for contracting the disease for different values of k. We can calculate the training and validation mean absolute error of each model. Because this data involves patterns over time, we need to be careful in how we split our data. We cannot, for example, split our data in half, shuffle it up, and do regular cross-validation. The reason being is that this is somewhat 'cheating' because the model gets to see what the future trend looks like. We can plot our errors as a function of k to determine which is the optimal k to use in order to get our final predicted value.

In order to perform a more robust analysis, we can determine the effects of certain factors on our predictive model. For instance, we can test whether splitting the data differently or taking out certain features improves the error of our model.

Preparing a Dataframe for Modeling

We need to do a quick operation to prepare the income dataset with the disease incidence tables.

In [216]:
#removing unnecessary columns from income
income_dfT1.drop(income_dfT1.loc[2012:].index, inplace=True)
income_dfT1.drop(columns=['UNITED STATES'], inplace=True)
income_dfT1
Out[216]:
GeoName ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE DISTRICT OF COLUMBIA FLORIDA ... SOUTH DAKOTA TENNESSEE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST VIRGINIA WISCONSIN WYOMING
year
1929 843.2 NaN 257.3 560.4 5495.5 635.6 1640.5 243.4 639.3 749.9 ... 286.2 971.2 2728.7 278.2 226.3 1054.5 1150.5 787.4 1966.8 149.9
1930 697.5 NaN 224.7 416.5 5079.4 596.8 1494.7 205.1 641.4 688.0 ... 247.3 841.9 2381.2 251.6 206.0 938.4 1030.7 706.1 1724.7 131.6
1931 583.7 NaN 184.5 381.8 4387.5 498.5 1313.1 187.7 631.7 592.6 ... 167.8 724.9 2038.5 190.8 169.2 906.9 844.7 616.3 1402.6 109.3
1932 421.9 NaN 137.0 278.5 3441.4 380.3 1026.9 144.9 567.9 482.4 ... 130.9 526.2 1560.7 157.4 129.9 701.9 637.4 448.2 1094.0 86.4
1933 435.6 NaN 131.8 283.2 3277.3 378.4 970.3 140.4 500.2 448.4 ... 93.9 552.6 1530.0 154.6 120.4 704.4 598.8 453.0 1014.1 85.5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2007 152984.4 29869.1 221244.8 89170.5 1572721.4 202115.9 203467.2 36380.4 34363.2 727260.7 ... 30673.9 210986.9 883810.4 86153.1 24280.9 342962.7 276678.4 54490.8 212179.2 24752.8
2008 158450.6 32907.7 224508.3 92864.7 1594742.5 210029.4 216174.5 36145.2 35329.0 729994.1 ... 32898.7 219027.3 971051.2 90386.9 25603.9 356536.0 290319.4 58167.1 220311.5 27163.2
2009 155913.5 33028.4 212646.0 91170.8 1540872.6 199351.6 212582.6 36432.9 35618.5 686755.8 ... 31893.7 217218.6 924343.9 86761.6 25291.7 349428.5 277305.3 58608.6 216466.3 24949.2
2010 162531.1 35452.0 216873.6 94576.1 1614040.8 205866.1 221058.5 36837.8 38324.4 732457.5 ... 33804.0 227883.0 985165.1 89439.2 26140.0 365245.5 285579.9 60753.8 222982.7 26328.8
2011 168474.3 37981.8 227700.1 100739.2 1715227.0 223492.6 226907.1 39925.0 41395.0 771409.5 ... 36842.8 241617.6 1077403.3 96356.7 27690.6 385984.5 302075.9 64227.3 235433.8 28786.6

83 rows × 51 columns

We make our large dataset that contains all of our data - year, state, income, incidence of each disease, and population density.

In [217]:
from itertools import product
copy = income_dfT1
tuples = list(product(copy.index, copy.columns)) #creating all possible combinations
tuples

index = pd.MultiIndex.from_tuples(tuples, names=["year", "state"]) #creating multiindex to use in new datafra,e
altogether = pd.DataFrame(index=index, columns=['Income']) #creating empty dataframe to later fill in values for

#Adding in income into total dataframe
for i in income_dfT1: #states - cols
  for j in income_dfT1.index: #years - rows
    altogether.at[(j, i), 'Income'] = income_dfT1.loc[j, i]

#Creating a function that adds each disease incidence dataframe to total dataframe
def add_disease_inc(df, name):
  for i in df: #states - cols
    for j in df.index: #years - rows
      altogether.at[(j, i), f'{name}'] = df.loc[j, i]
#Calling function for each disease
add_disease_inc(states_measles_incidence, 'Measles Incidence')
add_disease_inc(states_mumps_incidence, 'Mumps Incidence')
add_disease_inc(states_hepa_incidence, 'Hep-A Incidence')
add_disease_inc(states_pertussis_incidence, 'Pertussis Incidence')
add_disease_inc(states_polio_incidence, 'Polio Incidence')
add_disease_inc(states_rubella_incidence, 'Rubella Incidence')
add_disease_inc(states_smallpox_incidence, 'Smallpox Incidence')

#Creating dataframe of population densities
d = { k: v.set_index(['Year', 'Name']) for k, v in states_density.items()} #setting index to Year and Name for every dataframe within the dictionary states_density
df_densities = pd.concat(d, axis=0) #putting all of the dataframes into 1 dataframe
df_densities.reset_index(inplace = True)
df_densities.set_index(['Year', 'Name'], inplace=True)
df_densities

#Merging population density dataframe with big dataframe
altogether = altogether.merge(df_densities, left_index=True, right_on=['Year', 'Name'])
altogether.drop(columns= ['level_0', 'Geography Type'], inplace=True) #Don't need these columns
altogether.rename(columns={"Resident Population Density": "Pop Density"}, inplace=True)

#Sorting in order and making year and state name individual columns
altogether.sort_index(inplace=True, ascending=True)
altogether.reset_index(inplace=True)

altogether
Out[217]:
Year Name Income Measles Incidence Mumps Incidence Hep-A Incidence Pertussis Incidence Polio Incidence Rubella Incidence Smallpox Incidence Pop Density
0 1928 ALABAMA NaN 0.003350 NaN NaN NaN 0.000023 NaN 0.000129 51.12
1 1928 ALASKA NaN NaN NaN NaN NaN NaN NaN NaN 0.10
2 1928 ARIZONA NaN 0.002007 NaN NaN NaN 0.000026 NaN 0.000972 3.62
3 1928 ARKANSAS NaN 0.004818 NaN NaN NaN 0.000005 NaN 0.000134 35.22
4 1928 CALIFORNIA NaN 0.000692 NaN NaN NaN 0.000051 NaN 0.000185 33.52
... ... ... ... ... ... ... ... ... ... ... ...
4330 2011 VIRGINIA 385984.5 NaN NaN 3.211223e-06 0.000021 NaN NaN NaN 204.20
4331 2011 WASHINGTON 302075.9 NaN NaN 4.099538e-06 0.000107 NaN NaN NaN 102.67
4332 2011 WEST VIRGINIA 64227.3 NaN NaN 2.155911e-06 0.000020 NaN NaN NaN 76.85
4333 2011 WISCONSIN 235433.8 NaN NaN 8.753858e-07 0.000061 NaN NaN NaN 105.38
4334 2011 WYOMING 28786.6 NaN NaN 1.760074e-06 0.000012 NaN NaN NaN 5.81

4335 rows × 11 columns

Now, we examine only the case of measles to make things simpler, and keep in mind that we can examine any disease.

Additionally, we will drop any nans here as any data at this point such as the disease incidence can not be linearly interpolated without potentially skewing and creating false trends.

In [218]:
#Create shorter version of altogether dataframe
measles_together = altogether.drop(columns=['Mumps Incidence', 'Hep-A Incidence', 'Pertussis Incidence', 'Polio Incidence', 'Rubella Incidence', 'Smallpox Incidence'])

#Drop nans
measles_together.dropna(inplace=True)

#Reset index after drop
measles_together.reset_index(inplace=True)
measles_together.drop(columns=['index'], inplace=True)

measles_together
Out[218]:
Year Name Income Measles Incidence Pop Density
0 1929 ALABAMA 843.2 1.119138e-03 51.710000
1 1929 ARIZONA 257.3 5.488372e-04 3.710000
2 1929 ARKANSAS 560.4 6.722462e-04 35.410000
3 1929 CALIFORNIA 5495.5 7.275357e-04 34.960000
4 1929 COLORADO 635.6 7.420635e-04 9.910000
... ... ... ... ... ...
2944 2002 NEVADA 69843.2 0.000000e+00 19.363636
2945 2002 NEW YORK 709785.0 5.218697e-08 404.245455
2946 2002 OHIO 338738.3 0.000000e+00 278.618182
2947 2002 TEXAS 640381.6 4.606005e-08 82.800000
2948 2002 UTAH 58562.9 0.000000e+00 28.363636

2949 rows × 5 columns

Predictor 1 - Split Data in Half

We can create a predictor for measles that just splits the data in half.

With this predictor, we expect that our predicted incidence value will be much higher than in actuality because we have to predict on a year just after the vaccine came out- this is a piece of information that our model does not have nor is it trained to account for.

First, we split the data in half. Then, we define our features to train on as the year, state, income, and population density.

We can use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.

In [219]:
#JUST MEASLES PREDICTOR
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

# Fit model to data
from sklearn.neighbors import KNeighborsRegressor

# split data in half
# There are 2949 rows in measles_together, so 2949/2 = 1474.5
train = measles_together.iloc[0:1475]
val = measles_together.iloc[1475:]

# defining features
features = ["Year", 
            "Name",
            "Income",
            "Pop Density"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]

#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
  def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(X_train_sc, y_train)
    
    # Make predictions on the validation set.
    y_val_pred = model.predict(X_val_sc)
    rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
    
    return rmse
  
  val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
  val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
  cross_val = (val1 + val2)/2
  error_results.append(cross_val)

We plot these k values and the corresponding errors they give to determine the best k to use.

In [220]:
#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)
#xpos = np.where(error_results == ymin)
#xmin = xpos[0][0]
#print('xmin = ', xmin)

From the graph above, we see that the optimal k to use is 57. This makes sense because there are 50 states for a given year each of which are likely experiencing similar trends.

We don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.

Next, we take this k and make our predictions.

In [221]:
# Use k=57 to predict 
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
    
# Fit a 57-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=57)
model.fit(X_train_sc, y_train)
    
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())

#print(y_val_pred)
print('val error: ', ymin)
val error:  0.0032435563716947326

If we want to know the chances of catching measles in 1970 in Alabama, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe

In [222]:
val[val['Year'] == 1970]
#Find that 1970 ALabama is at index 1967 

#1967 - 1475 = 492 gives us the index in the predicted array
y_val_pred[492]
Out[222]:
0.002460425878952384

If I want to know the chances of catching measles in 1970 in Alabama, the results show that the incidence is 0.002460425878952384.

The expected value is 0.000139. It makes sense that the obtained and expected values are off because we were predicting on a year right after the measles vaccine came out, and the incidence values drastically drop off at this point.

In [223]:
#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((.002460425878952384 - .000139)/ .000139) * 100
Out[223]:
1670.090560397399

The percent error here is 1670%! Let's see if we can reduce this.

Predictor 2 - Splitting the Data Into Quarters

Now, to examine another way of splitting the data, we can split the data into about quarters and only examine the times that the diseases were prevalent before vaccine release. We predict that our predicted incidence value will be closer to the expected than in Model 1.

Specifically for measles, the vaccine was released in 1963. So we want to examine measles incidence from 1929 to 1963 instead of 1929 to 2002 in order to give our model a better shot of actively predicting.

In order to figure out which indices to use from measles_together, we can see the last index where we have data for the year 1963.

In [224]:
measles_together[measles_together['Year'] == 1963].tail()
Out[224]:
Year Name Income Measles Incidence Pop Density
1665 1963 VIRGINIA 10106.6 0.001725 105.66
1666 1963 WASHINGTON 8393.9 0.001669 45.42
1667 1963 WEST VIRGINIA 3400.1 0.006963 75.96
1668 1963 WISCONSIN 10135.2 0.014131 75.58
1669 1963 WYOMING 878.9 0.000979 3.40

The last index where we have data for the year 1963 is 1669. Therefore, we will split the data in half from indices 0 to 1669.

We proceed as we did last time but split the data differently.

First, we split this portion of data in half. Then, we define our features to train on as the year, state, income, and population density.

We will again use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.

In [225]:
#JUST MEASLES PREDICTOR
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

# Fit model to data
from sklearn.neighbors import KNeighborsRegressor

# split data in half up to vaccine release in 1963
# 1669/2 = 834.5
train = measles_together.iloc[0:835]
val = measles_together.iloc[835:1669]

# defining features
features = ["Year", 
            "Name",
            "Income",
            "Pop Density"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]

#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
  def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(X_train_sc, y_train)
    
    # Make predictions on the validation set.
    y_val_pred = model.predict(X_val_sc)
    rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
    
    return rmse
  
  val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
  val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
  cross_val = (val1 + val2)/2
  error_results.append(cross_val)

We plot these k values and the corresponding errors they give to determine the best k to use.

In [226]:
#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)

From the graph above, we see that the optimal k to use is about 25. This is most likely because we reduced our training/test size in half.

We again don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.

Next, we take this k and make our predictions.

In [227]:
# Use k=25 to predict 
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
    
# Fit a 25-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=25)
model.fit(X_train_sc, y_train)
    
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())

#print(y_val_pred)
print('val error: ', ymin)
val error:  0.003022676298335747

If we want to know the chances of catching measles in 1955 in Alabama (about halfway through the predicted dataset, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe.

In [228]:
val[val['Year'] == 1963]
#Find that 1955 ALabama is at index 1227

#1227 - 835 = 392 gives us the index in the predicted array
y_val_pred[392]
Out[228]:
0.001797885794424771

If I want to know the chances of catching measles in 1963 in Alabama, the results show that the incidence is 0.001797885794424771, compared to the expected value 0.000347.

In [229]:
#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((.001797885794424771 - .000347)/ .000347) * 100
Out[229]:
418.1227073270234

The percentage error here is 418%, which is still pretty large, but drastically reduced from model 1 by 847%!

Additionally as a note to keep in mind, these incidence numbers are so small that the percent difference between them seems extremely high - so the difference between the percents is important.

Predictor 3 - What Features Impact our Model?

Now, to examine what happens when remove a feature to predict on, we will remove the income feature. We predict based on our correlations between disease incidence and income that the error will go up.

Since our predictor 2 improved the percent error, we will build off of that model and examine the measles incidence from 1929 to 1963 right before the vaccine came out.

First, we split the data in half. Then, we define our features to train on as the year, state, and population density.

We can use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.

In [230]:
#JUST MEASLES PREDICTOR - no income
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

# Fit model to data
from sklearn.neighbors import KNeighborsRegressor

# split data in half up to vaccine release in 1963
# 1669/2 = 834.5
train = measles_together.iloc[0:835]
val = measles_together.iloc[835:1669]

# defining features WITHOUT INCOME
features = ["Year", 
            "Name",
            "Pop Density"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]

#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
  def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(X_train_sc, y_train)
    
    # Make predictions on the validation set.
    y_val_pred = model.predict(X_val_sc)
    rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
    
    return rmse
  
  val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
  val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
  cross_val = (val1 + val2)/2
  error_results.append(cross_val)

We plot these k values and the corresponding errors they give to determine the best k to use.

In [231]:
#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)

From the graph above, we see that the optimal k to use is about 25 again, similar to our Model 2.

We again don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.

Next, we take this k and make our predictions.

In [232]:
# Use k=25 to predict 
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
    
# Fit a 25-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=25)
model.fit(X_train_sc, y_train)
    
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())

#print(y_val_pred)
print('val error: ', ymin)
val error:  0.003024193805600776

If we want to know the chances of catching measles in 1955 in Alabama (about halfway through the predicted dataset, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe.

In [233]:
val[val['Year'] == 1955]
#Find that 1955 ALabama is at index 1227

#1227 - 835 = 392 gives us the index in the predicted array
y_val_pred[392]
Out[233]:
0.002084132350947432

If I want to know the chances of catching measles in 1963 in Alabama, the results show that the incidence is 0.002084132350947432, compared to the expected value 0.000347.

In [234]:
#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((0.002084132350947432 - .000347)/ .000347) * 100
Out[234]:
500.6145103594905

The percentage error here is 500%, which is still pretty large, and actually increased from Model 2 by 82%. This increase is as we expected and shows that the correlation we observed earlier between incidence and state income does in fact affect our model.

Predictor 4 - Removing Population Density

Now, we will examine what happens when remove the population density feature to predict on. We predict based on our correlations between disease incidence and population density that the error will go down. Since our predictor 2 improved the percent error, we will build off of that model.

Again, we examine the measles incidence from 1929 to 1963 right before the vaccine came out.

First, we split the data in half. Then, we define our features to train on as the year, state, and population density.

We can use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.

In [235]:
#JUST MEASLES PREDICTOR - no income
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

# Fit model to data
from sklearn.neighbors import KNeighborsRegressor

# split data in half up to vaccine release in 1963
# 1669/2 = 834.5
train = measles_together.iloc[0:835]
val = measles_together.iloc[835:1669]

# defining features WITHOUT POPULATION DENSITY 
features = ["Year", 
            "Name",
            "Income"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]

#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
  def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(X_train_sc, y_train)
    
    # Make predictions on the validation set.
    y_val_pred = model.predict(X_val_sc)
    rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
    
    return rmse
  
  val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
  val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
  cross_val = (val1 + val2)/2
  error_results.append(cross_val)

We plot these k values and the corresponding errors they give to determine the best k to use.

In [236]:
#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)

From the graph above, we see that the optimal k to use is about 25 again, similar to our Model 2.

We again don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.

Next, we take this k and make our predictions.

In [237]:
# Use k=25 to predict 
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
    
# Fit a 25-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=25)
model.fit(X_train_sc, y_train)
    
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())

#print(y_val_pred)
print('val error: ', ymin)
val error:  0.003014331688575157

If we want to know the chances of catching measles in 1955 in Alabama (about halfway through the predicted dataset, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe.

In [238]:
val[val['Year'] == 1955]
#Find that 1955 ALabama is at index 1227

#1227 - 835 = 392 gives us the index in the predicted array
y_val_pred[392]
Out[238]:
0.002045478209686451

If I want to know the chances of catching measles in 1963 in Alabama, the results show that the incidence is 0.002045478209686451, compared to the expected value 0.000347.

In [239]:
#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((0.002045478209686451 - .000347)/ .000347) * 100
Out[239]:
489.4749883822625

The percentage error here is 489%, which increased from Model 2 by 71%. This increase is not what we expected; however, it has slightly less of an increase as compared to Model 3, meaning that population density may have had less of an effect than income.

In [240]:
%%shell

jupyter nbconvert --to html "/content/drive/MyDrive/cmps3160/finalProject/Final_Project_Disease_Prevalence.ipynb"
[NbConvertApp] WARNING | pattern '/content/drive/MyDrive/cmps3160/finalProject/Final_Project_Disease_Prevalence.ipynb' matched no files
This application is used to convert notebook files (*.ipynb)
        to various other formats.

        WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
    Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
    Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
    Write notebook output to stdout instead of files.
    Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
    Run nbconvert in place, overwriting the existing notebook (only 
            relevant when converting to notebook format)
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
    Clear output of current file and save in place, 
            overwriting the existing notebook.
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
    Exclude input and output prompts from converted document.
    Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
    Exclude input cells and output prompts from converted document. 
            This mode is ideal for generating code-free reports.
    Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True]
--log-level=<Enum>
    Set the log level by value or name.
    Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
    Default: 30
    Equivalent to: [--Application.log_level]
--config=<Unicode>
    Full path of a config file.
    Default: ''
    Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
    The export format to be used, either one of the built-in formats
            ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides']
            or a dotted object name that represents the import path for an
            `Exporter` class
    Default: 'html'
    Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
    Name of the template file to use
    Default: ''
    Equivalent to: [--TemplateExporter.template_file]
--writer=<DottedObjectName>
    Writer class used to write the 
                                        results of the conversion
    Default: 'FilesWriter'
    Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
    PostProcessor class used to write the
                                        results of the conversion
    Default: ''
    Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
    overwrite base name use for output files.
                can only be used when converting one notebook at a time.
    Default: ''
    Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
    Directory to write output(s) to. Defaults
                                  to output to the directory of each notebook. To recover
                                  previous default behaviour (outputting to the current 
                                  working directory) use . as the flag value.
    Default: ''
    Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
    The URL prefix for reveal.js (version 3.x).
            This defaults to the reveal CDN, but can be any url pointing to a copy 
            of reveal.js. 
            For speaker notes to work, this must be a relative path to a local 
            copy of reveal.js: e.g., "reveal.js".
            If a relative path is given, it must be a subdirectory of the
            current directory (from which the server is run).
            See the usage documentation
            (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
            for more details.
    Default: ''
    Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
    The nbformat version to write.
            Use this to downgrade notebooks.
    Choices: any of [1, 2, 3, 4]
    Default: 4
    Equivalent to: [--NotebookExporter.nbformat_version]

Examples
--------

    The simplest way to use nbconvert is

            > jupyter nbconvert mynotebook.ipynb

            which will convert mynotebook.ipynb to the default format (probably HTML).

            You can specify the export format with `--to`.
            Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides'].

            > jupyter nbconvert --to latex mynotebook.ipynb

            Both HTML and LaTeX support multiple output templates. LaTeX includes
            'base', 'article' and 'report'.  HTML includes 'basic' and 'full'. You
            can specify the flavor of the format used.

            > jupyter nbconvert --to html --template basic mynotebook.ipynb

            You can also pipe the output to stdout, rather than a file

            > jupyter nbconvert mynotebook.ipynb --stdout

            PDF is generated via latex

            > jupyter nbconvert mynotebook.ipynb --to pdf

            You can get (and serve) a Reveal.js-powered slideshow

            > jupyter nbconvert myslides.ipynb --to slides --post serve

            Multiple notebooks can be given at the command line in a couple of 
            different ways:

            > jupyter nbconvert notebook*.ipynb
            > jupyter nbconvert notebook1.ipynb notebook2.ipynb

            or you can specify the notebooks list in a config file, containing::

                c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

            > jupyter nbconvert --config mycfg.py

To see all available configurables, use `--help-all`.

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-240-38cb401cffd3> in <module>
----> 1 get_ipython().run_cell_magic('shell', '', '\njupyter nbconvert --to html "/content/drive/MyDrive/cmps3160/finalProject/Final_Project_Disease_Prevalence.ipynb"\n')

/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2357             with self.builtin_trap:
   2358                 args = (magic_arg_s, cell)
-> 2359                 result = fn(*args, **kwargs)
   2360             return result
   2361 

/usr/local/lib/python3.8/dist-packages/google/colab/_system_commands.py in _shell_cell_magic(args, cmd)
    107   result = _run_command(cmd, clear_streamed_output=False)
    108   if not parsed_args.ignore_errors:
--> 109     result.check_returncode()
    110   return result
    111 

/usr/local/lib/python3.8/dist-packages/google/colab/_system_commands.py in check_returncode(self)
    132   def check_returncode(self):
    133     if self.returncode:
--> 134       raise subprocess.CalledProcessError(
    135           returncode=self.returncode, cmd=self.args, output=self.output)
    136 

CalledProcessError: Command '
jupyter nbconvert --to html "/content/drive/MyDrive/cmps3160/finalProject/Final_Project_Disease_Prevalence.ipynb"
' returned non-zero exit status 255.

Conclusions and Summary

Throughout this project we completed a lot of analysis specifically on disease incidence and trends between states within the population. We utilized several visualization techniques to demonstrate the variances both across the years as well as between states both seperate and in comparison to national trends. In an effort to understand these numbers more in depth, we included datasets focused on state population density and state average income level. We measured the correlation between each of these variables and disease incidence for every state over the years, and discovered that both population density and state average income feature a moderate negative correlation. From here we built dour different predictive models in an attempt to really demonstrate what variables could be related, and we discovered that the strongest thing to improve our model's predictions was to pick a smaller time. All of this indicates that while population density and state average income are potentially pieces of the complicated puzzle of disease spread, they are certainly not the whole picture, and without a better understanding of the individual people that make up those case numbers, our knowledge and what assumptions we can make are limited.