Alyssa Bockman & Thalia Koutsougeras¶

Final Project Disease Prevalence Across the States

Main Data from: https://think.cs.vt.edu/corgis/csv/health/

GitHub: https://tkoooo.github.io/Final-Portfolio/

Project Plan¶

Project Description and Goals: This dataset will allow us to examine the prevalence of various diseases across all 50 states over the last century. We plan to research factors such a vaccination rates and to try and understand the peaks and valleys in the dataset. We also want to examine what regions and states have been most and least affected and compare it with factors like weather patterns and population density.

Question: How does location correlate with probability of catching any 1 disease within the 20th century. Furhtermore, do other factors such as income and population density show this same correlation - this could potentially explain why certain states showed a higher incidence for disease when compared to others

Collaboration Plan: Our collaboration plan is to continue meeting at least once a week over zoom on Wednesday nights.

Data Collection - Disease Incidence Data¶

Data from: https://think.cs.vt.edu/corgis/csv/health/

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/cmps3160

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/cmps3160

%cd /content/drive/My Drive/cmps3160/finalProject

/content/drive/My Drive/cmps3160/finalProject

Data Processing - National Incidence and Case Numbers¶

import pandas as pd
import matplotlib.pyplot as plt
#import pandas and matplotlib for graphing

df = pd.read_csv("../finalProject/health.csv", sep=",")
#read dataset into variable df

#check what values are missing
df[df.isnull().any(axis=1)]

Lucky for us, no data is missing and we can proceed onto analysis!

#Data types analysis

measles = df[df['disease'] == "MEASLES"]
measles['increase'].min()

0.0

#in order to visualize the case prevelance across the entire country, we will drop the 
#location and increase columns and then create seperate dataframes for each disease
#df = df.drop(columns=['loc', 'increase'])
measles = df[df['disease'] == "MEASLES"]
polio = df[df['disease'] == "POLIO"]
smallpox = df[df['disease'] == "SMALLPOX"]
pertussis = df[df['disease'] == "PERTUSSIS"]
hepa = df[df['disease'] == "HEPATITIS A"]
rubella = df[df['disease'] == "RUBELLA"]
mumps = df[df['disease'] == "MUMPS"]

Now that we have datasets that hold only the specific disease information for each illness, we can begin to perform some basic analysis.

We will extract data to get numbers in the country instead of by state and perform further analysis to calculate the incidence for each disease each year.

In public health there are 2 main ways to measure how disease affects a population: incidence and prevalence.

Incidence refers to the number of people that develop/are diagnosed within a time period divided by the total sample size, while prevalence refers to the number of people being actively affected by the disease (whether they were diagnosed within the time period or not) divided by the total smaple size.

Because this data set provides us with the number of new cases per year and nothing about the amount of time people were sick, we will be focusing on calculating the incidence for each disease throughout the entire country.

This will provide 1 helpful dataset for each disease that will allow us to graph both the case numbers in the country as well as the proportion of people affected over the years.

#Basic function to create dataframes

def initial_dataframes(df):
  #grab the case numbers country wide for each year
  country = df.groupby(by='year')['number'].sum()

  #grab the population size country side for each year
  population = df.groupby(by='year')['population'].sum()

  #transform the number and population series in to dataframes for future operations
  country = country.to_frame()
  population = population.to_frame()

  #join the number and population dataframes to allow us to create a new incidence column
  incidence = population.join(other=country, on='year')

  #create a new column called incidence that shows the proportion of the population affected each year
  incidence["incidence"] = incidence["number"]/incidence["population"]

  #change the index to regular numbers rather than the year 
  incidence.reset_index(inplace =True)

  return country, population, incidence

We also need to bring in the vaccine dataframe before plotting our disease case numbers. The years for each vaccine we are investigating were taken from: https://www.immunize.org/timeline/

vaccine_data = [("Measles", 1963), ("Polio", 1955), ("Smallpox", 1798), ("Pertussis", 1915), ("Hepatitis A", 1995), ("Rubella", 1971), ("Mumps", 1971)]
vaccine_df = pd.DataFrame(vaccine_data, columns=['Disease', 'Year']) #Creating a dataframe out of each disease and the year their vaccine came out
vaccine_df = vaccine_df.sort_values('Year') #We want this sorted by year because we run into annotation issues later if not since Rubella and Mumps are in the same year
vaccine_df.reset_index(inplace=True)
vaccine_df.drop(columns=['index'], inplace=True) #Cleaning up data
vaccine_df

Measles¶

measles_country, measles_population, measles_incidence = initial_dataframes(measles)
measles_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Measles']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Measles'], 800000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Measles Case Numbers Over Time"); #Giving graph a title

Polio¶

polio_country, polio_population, polio_incidence = initial_dataframes(polio)
polio_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Polio']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Polio'], 50000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Polio Case Numbers Over Time"); #Giving graph a title

Smallpox¶

smallpox_country, smallpox_population, smallpox_incidence = initial_dataframes(smallpox)
smallpox_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers')
plt.title("Smallpox Case Numbers Over Time"); #Giving graph a title

Pertussis¶

pertussis_country, pertussis_population, pertussis_incidence = initial_dataframes(pertussis)
pertussis_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers')
plt.title("Pertussis Case Numbers Over Time"); #Giving graph a title

Hepatitis A¶

hepa_country, hepa_population, hepa_incidence = initial_dataframes(hepa)
hepa_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10),legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Hepatitis A']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Hepatitis A'], 55000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Hepatitis A Case Numbers Over Time"); #Giving graph a title

Rubella¶

rubella_country, rubella_population, rubella_incidence = initial_dataframes(rubella)
rubella_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Rubella']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Rubella'], 50000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Rubella Case Numbers Over Time"); #Giving graph a title

Mumps¶

mumps_country, mumps_population, mumps_incidence = initial_dataframes(mumps)
mumps_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers

#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Mumps']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Mumps'], 130000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.title("Mumps Case Numbers Over Time"); #Giving graph a title

National Case Numbers and National Incidence¶

fig = plt.figure(figsize=(20,10));

colors = ['red', 'orange', 'green', 'blue', 'purple', 'brown', 'pink']; #colors for legend
colors_2 = ['green', 'blue', 'orange', 'red', 'brown', 'pink', 'purple']; #colors matching vaccine dataset
i = 0

# Plotting each disease over time
for frame in [measles_incidence, polio_incidence, smallpox_incidence, pertussis_incidence, hepa_incidence, rubella_incidence, mumps_incidence]:
    plt.plot(frame['year'], frame['number'], color=f'{colors[i]}');
    i += 1

for i in range(0, len(vaccine_df)): #this plots vertical lines where the vaccines came out
  if vaccine_df.iloc[i].Year < 1930: #this removes the whitespace that appears before the graph because of the vaccines that came out before 1930
    continue
  plt.axvline(x=vaccine_df.iloc[i].Year, color = f'{colors_2[i]}'); #this matches the color of the vertical line to the color of the disease in the legend
  plt.annotate(s = f'{vaccine_df.iloc[i].Disease}', xy=(vaccine_df.iloc[i].Year, 900000 - i*50000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.legend(["Measles", "Polio", "Smallpox", "Pertussis", "Hepatitis A", "Rubella", "Mumps"]);
plt.title("Disease Case Numbers");
plt.xlim(1925, 2015); #we have to cut off the limits because 2 of the vaccines appeared way before 1930 and make the graph hard to read

fig = plt.figure(figsize=(20,10));

colors = ['red', 'orange', 'green', 'blue', 'purple', 'brown', 'pink']; #colors for legend
colors_2 = ['green', 'blue', 'orange', 'red', 'brown', 'pink', 'purple']; #colors matching vaccine dataset
i = 0

# Plotting each disease over time
for frame in [measles_incidence, polio_incidence, smallpox_incidence, pertussis_incidence, hepa_incidence, rubella_incidence, mumps_incidence]:
    plt.plot(frame['year'], frame['incidence'], color=f'{colors[i]}');
    i += 1

for i in range(0, len(vaccine_df)): #this plots vertical lines where the vaccines came out
  if vaccine_df.iloc[i].Year < 1930: #this removes the whitespace that appears before the graph because of the vaccines that came out before 1930
    continue
  plt.axvline(x=vaccine_df.iloc[i].Year, color = f'{colors_2[i]}'); #this matches the color of the vertical line to the color of the disease in the legend
  plt.annotate(s = f'{vaccine_df.iloc[i].Disease}', xy=(vaccine_df.iloc[i].Year, 0.006 - i*0.0002), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});

plt.legend(["Measles", "Polio", "Smallpox", "Pertussis", "Hepatitis A", "Rubella", "Mumps"]);
plt.title("Disease Incidence");
plt.xlim(1925, 2015); #we have to cut off the limits because 2 of the vaccines appeared way before 1930 and make the graph hard to read

It is worth nothing that the vaccines for Pertussis and Smallpox were released prior to the first collection year of this dataset, so they are nto marked

All of the vaccines seem to have been released after the peak of the disease which may speak to the either the speed and efficiency of US Public Health, the perceived need, or the state of science at the time.

Exploratory Analysis & Data Visualization Part 1 - State to State Incidence¶

Now that we have a solid idea of what the trends look like on a national scale, we can zoom back in to look at individual states to try and see how they compare.

One great way to visualize disease incidence per state is to make a spatial heatmap of the US. Given any year and disease, we can use the Geopandas library to visualize the prevalence of the disease in each state.

Preparing a Dataset for Graphing¶

Cartographic Boundary Files 2012 - Shapefile of the United States: https://www.sciencebase.gov/catalog/item/52c78623e4b060b9ebca5be5

We originally got our data from the US Census, but there was too much data (each state had multiple coordinates for its regions). We found this dataset from sciencebase.gov, which is a U.S. Geological Survey Trusted Digital Repository. Their files are simpler and contain single coordinates for each state. We want only one row of data for each state so that we can easily join our data with our other datasets.

We followed these tutorials to implement our code:

https://mohammadimranhasan.com/geospatial-data-mapping-with-python/

https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d

First, we need to install geopandas in order to create maps and import our coordinates file.

!pip install geopandas
#!pip install --upgrade pyshp
#!pip install --upgrade shapely
#!pip install --upgrade descartes

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: geopandas in /usr/local/lib/python3.8/dist-packages (0.12.2)
Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.3.5)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from geopandas) (21.3)
Requirement already satisfied: shapely>=1.7 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.8.5.post1)
Requirement already satisfied: pyproj>=2.6.1.post1 in /usr/local/lib/python3.8/dist-packages (from geopandas) (3.4.1)
Requirement already satisfied: fiona>=1.8 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.8.22)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (22.1.0)
Requirement already satisfied: munch in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (57.4.0)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (1.15.0)
Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (7.1.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (2022.9.24)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (1.21.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (2022.6)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->geopandas) (3.0.9)

We need to read in the zip file.

import geopandas
states = geopandas.read_file("tl_2012_us_state.zip")

states.head()

Cleaning up the dataset

There are some columns we don't need, and we can rename the columns to make them look neater.

states.columns.unique() #grabbing column names

Index(['OBJECTID', 'REGION', 'DIVISION', 'STATEFP', 'STATENS', 'GEOID',
       'STUSPS', 'NAME', 'LSAD', 'MTFCC', 'FUNCSTAT', 'ALAND', 'AWATER',
       'INTPTLAT', 'INTPTLON', 'Shape_Leng', 'Shape_Area', 'geometry'],
      dtype='object')

states.drop(columns=['OBJECTID', 'REGION', 'DIVISION', 'STATEFP', 'STATENS', 'GEOID',
       'STUSPS', 'LSAD', 'MTFCC', 'FUNCSTAT', 'ALAND', 'AWATER',
       'INTPTLAT', 'INTPTLON', 'Shape_Leng', 'Shape_Area'], inplace=True) #dropping unnecessary columns
states.rename(columns={'NAME':'State', 'geometry': 'Geometry'}, inplace=True) #renaming the columns to make them look prettier
states['State'] = states['State'].str.upper() #in order to join with previous datasets we want this to match
states.head()

states.shape[0]

56

From the code above, we see that there are 56 rows, even though we know there are 50 states. By looking at the data, we see acquired regions of the US included, such as Puerto Rico and Guam. We need to keep this in mind for joining our data.

We need to check for NaN values.

states.isnull().sum()

State       0
Geometry    0
dtype: int64

Perfect! We have no NaN values.

Now we check the dtypes. We expect State should be an object and Geometry should be a geometry (because of the way it was loaded with geopandas).

Just to test that we are using GeoPandas correctly, here is a simple map of the US.

states = geopandas.GeoDataFrame(states, geometry='Geometry') #turning the dataframe into a geodataframe object
states.plot(figsize=(10, 10))
plt.xlim(-2.25e7, -0.5e7)
plt.ylim(0.0, 1.2e7)

(0.0, 12000000.0)

Great! We are good to move on with analysis. This work will be reffered to later in the seperate sections for national incidence of each disease

Incidence Column Function¶

This function takes in one of the generic disease dfs and adds on a column called incidence that calculates the fraction of new cases/total populateion for every state. It returns 2 things - a DataFrame called states_disease_incidence that has the incidence for every states every year for the given disease, and a dictionary called states_disease which has seperate Dataframes for each state with the year and case number for the given disease

#Incidence Column Function
def incidence_df(df):
  #create a dictionary of dataframs for every state
  states_disease = dict()
  for k, v in df.groupby('loc'):
    states_disease[k] = v

  #create an incidence column for every dataframe in the dictionary
  for i in states_disease:
    states_disease[i]['incidence'] = states_disease[i]['number']/states_disease[i]['population']
    #set the index to year for easy combining later
    states_disease[i].set_index('year', inplace = True)

  #create new dataframe of all of the incidences by state for every year
  states_disease_incidence = pd.DataFrame()

  #enter the incidence columns from the state datasets into the incidence dataset
  for i in states_disease:
    states_disease_incidence[i] = states_disease[i]['incidence']
  
  return states_disease_incidence, states_disease

Map DF function¶

This function takes in the states disease DataFrame created in the incidence df function and properly formats it for ease of graphing with Geopandas.

#create a DataFrame to make the visual map of incidence
def map_df(states_disease_incidence):
  map_disease_df = states_disease_incidence.T #transpose to get states in a column
  map_disease_df.reset_index(inplace=True)
  map_disease_df.rename(columns={'index':'State'}, inplace=True)
  map_disease_df = pd.merge(map_disease_df, states, on="State", how='left') #we want to use the keys of the incidence values because we know those are 50 states, not including Guam, etc.
  return map_disease_df

Create Map Function¶

The create_map function takes in 'sel', the selected year we want to examine in the form of an integer, 'map_disease_df', the DataFrame we created in the map_df function specifically for graphing, and the name of the disease in the form of a string to properly label the resulting grpah.

The colormapped scalebar on the right changes per disease, but does not change depending on the year chosen because it is based on the min and max values per disease.

def create_map(sel, map_disease_df, disease):
  map_disease_df = geopandas.GeoDataFrame(map_disease_df, geometry='Geometry') #turning the dataframe into a geodataframe object to be used with geopandas to plot

  #Here we are trying to find the min and max values to use for our scalebar
  #We don't want to do this for each year because looking at the graph over time would be misleading if you kept changing the scalebars
  map_disease_df.columns.unique()
  merged_cols = map_disease_df.columns.unique()

  merged_cols = merged_cols[1:-1] #only grabbing the years
  vmin = map_disease_df[merged_cols].min().min()
  vmax = map_disease_df[merged_cols].max().max()

  #Plotting
  fig, ax = plt.subplots(1, figsize=(10,7))
  map_disease_df.plot(column=sel, figsize=(10, 7), cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8')
  plt.xlim(-2.25e7, -0.5e7) #resetting axes
  plt.ylim(0.5, 1.2e7)
  ax.axis('off') #remove axis because it means nothing for us

  ax.set_title(f'Incidence Rate of {disease} in {sel} across the US')

  sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=vmin, vmax=vmax)) #for color scalebar
  sm._A = []
  cbar = fig.colorbar(sm)

Creating and Graphing Incidence¶

We will now use the 3 functions created above to visualize the state incidences for any given year

Measles¶

#utilize the above functions to create dictionary of state specific DataFrames
states_measles_incidence, states_measles = incidence_df(measles)

states_measles_incidence.head()

states_measles['ALABAMA']

map_measles_df = map_df(states_measles_incidence)
map_measles_df.head()

create_map(1960, map_measles_df, 'Measles')

Polio¶

#utilize the above functions to create dictionary of state specific DataFrames
states_polio_incidence, states_polio = incidence_df(polio)

states_polio_incidence.head()

states_polio['ALABAMA']

map_polio_df = map_df(states_polio_incidence)
map_polio_df.head()

create_map(1960, map_polio_df, 'Polio')

Pertussis¶

#utilize the above functions to create dictionary of state specific DataFrames
states_pertussis_incidence, states_pertussis = incidence_df(pertussis)

states_pertussis_incidence.head()

states_pertussis['ALABAMA']

map_pertussis_df = map_df(states_pertussis_incidence)
map_pertussis_df.head()

create_map(2011, map_pertussis_df, 'Pertussis')

Smallpox¶

#utilize the above functions to create dictionary of state specific DataFrames
states_smallpox_incidence, states_smallpox = incidence_df(smallpox)

states_smallpox_incidence.head()

states_smallpox['ALABAMA']

map_smallpox_df = map_df(states_smallpox_incidence)
map_smallpox_df.head()

create_map(1940, map_smallpox_df, 'Smallpox')

Hepatitis A¶

#utilize the above functions to create dictionary of state specific DataFrames
states_hepa_incidence, states_hepa = incidence_df(hepa)

states_hepa_incidence.head()

states_hepa['ALABAMA']

map_hepa_df = map_df(states_hepa_incidence)
map_hepa_df.head()

create_map(2011, map_hepa_df, 'Pertussis')

Rubella¶

#utilize the above functions to create dictionary of state specific DataFrames
states_rubella_incidence, states_rubella = incidence_df(rubella)

states_rubella_incidence.head()

states_rubella['ALABAMA']

map_rubella_df = map_df(states_rubella_incidence)
map_rubella_df.head()

create_map(1970, map_rubella_df, 'Rubella')

Mumps¶

#utilize the above functions to create dictionary of state specific DataFrames
states_mumps_incidence, states_mumps = incidence_df(mumps)

states_mumps_incidence.head()

states_mumps['ALABAMA']

map_mumps_df = map_df(states_mumps_incidence)
map_mumps_df.head()

create_map(1970, map_mumps_df, 'Mumps')

Exploratory Analysis & Data Visualization Part 2 - Incidence Analysis¶

Now that we have an understanding and datasets portraying incidence on a national and state wide scale, we can see how the states compare to the national average and establish a ranking of what states were affected by a given disease the most each year.

We will start by creating a dataset that has the incidence for every state every year as well as the entire country.

def US_state_df(disease_incidence, states_disease_incidence):
  #reset index to year for joining
  disease_incidence.set_index('year', inplace = True)

  #grab the national incidence and add it into a new dataframe with state incidences
  US = disease_incidence['incidence']
  US_state_disease_incidence = states_disease_incidence.join(US)

  #rename column 
  US_state_disease_incidence.rename(columns={'incidence':'UNITED STATES'}, inplace = True)

  return US_state_disease_incidence

def graph_state_incidence(US_state_disease_incidence):

  states = list(US_state_disease_incidence.columns)
  states.remove('UNITED STATES')

  disease_compare_incidence = US_state_disease_incidence[states].div(US_state_disease_incidence['UNITED STATES'], axis=0)

  disease_hist = pd.DataFrame()

  for state in disease_compare_incidence:
    count = disease_compare_incidence[state][disease_compare_incidence[state] > 1].count()
    disease_hist.insert(0, state, [count])

  disease_hist = disease_hist.T
  disease_hist.sort_values(0, ascending = False, inplace = True)

  disease_hist.plot.bar(title='Number Of Times State Incidence is Above National Average Incidence', figsize=(20, 10), legend=False)

Measles¶

US_state_measles_incidence = US_state_df(measles_incidence, states_measles_incidence)
graph_state_incidence(US_state_measles_incidence)

Polio¶

US_state_polio_incidence = US_state_df(polio_incidence, states_polio_incidence)
graph_state_incidence(US_state_polio_incidence)

Pertussis¶

US_state_pertussis_incidence = US_state_df(pertussis_incidence, states_pertussis_incidence)
graph_state_incidence(US_state_pertussis_incidence)

Smallpox¶

US_state_smallpox_incidence = US_state_df(smallpox_incidence, states_smallpox_incidence)
graph_state_incidence(US_state_smallpox_incidence)

Hepatitis A¶

US_state_hepa_incidence = US_state_df(hepa_incidence, states_hepa_incidence)
graph_state_incidence(US_state_hepa_incidence)

Rubella¶

US_state_rubella_incidence = US_state_df(rubella_incidence, states_rubella_incidence)
graph_state_incidence(US_state_rubella_incidence)

Mumps¶

US_state_mumps_incidence = US_state_df(mumps_incidence, states_mumps_incidence)
graph_state_incidence(US_state_mumps_incidence)

Data Collection and Processing (ETL) - Population Density and Income Data¶

Population Density¶

Data From CDC: https://www.census.gov/data/tables/time-series/dec/density-data-text.html

df_density = pd.read_csv("../finalProject/apportionment.csv", sep=",")

df_density.head()

#grab column names for easy transformation
df_density.columns
#['Name', 'Geography Type', 'Year', 'Resident Population',
       #'Percent Change in Resident Population', 'Resident Population Density',
       #'Resident Population Density Rank', 'Number of Representatives',
       #'Change in Number of Representatives',
       #'Average Apportionment Population Per Representative']

#first we need to get rid of the unnecessary information just to make things cleaner
df_density = df_density.drop(columns=['Resident Population', 'Resident Population Density Rank',
                                      'Percent Change in Resident Population', 
                                      'Number of Representatives',	'Change in Number of Representatives', 
                                      'Average Apportionment Population Per Representative'])
df_density

There are also some values within the dataframe that we don't need (at least for initial use with our first dataset) such as the regions and the nationwide information. To change this we can just grab the rows with the type "State"

pop_density = df_density[df_density['Geography Type'] == 'State']
pop_density = pop_density.drop(pop_density[pop_density['Name'] == 'Puerto Rico'].index)
pop_density

pop_density[pop_density['Name'] == 'Alabama']

#check the datatypes of the dataset
pop_density.dtypes

Name                           object
Geography Type                 object
Year                            int64
Resident Population Density    object
dtype: object

#change the resident population density to a float
pop_density['Resident Population Density'] = pd.to_numeric(pop_density['Resident Population Density'],errors='coerce')

pop_density['Resident Population Density'].describe()

count    592.000000
mean     122.555912
std      175.061531
min        0.100000
25%       23.450000
50%       56.900000
75%      138.000000
max      974.700000
Name: Resident Population Density, dtype: float64

#We also need to make the states uppercase for easu combining later

pop_density['Name'] = pop_density['Name'].str.upper()

pop_density.head()

The final step to combining this with the initial disease dataset is to ensure that all of the years present in the disease dataset also exist in the population density dataset. This data is grabbed from the census so there are only reports for the 1st year of every decade. While not exact, we can fill in the expected density for each year based on the values provided for each decade. This is done by creating rows for each year and then using interpolate to fill in NaN values by approximating them.

#create list of years needed to fill in the population density data frame
years = list(range(1911, 2020, 1))


years.remove(1920)
years.remove(1930)
years.remove(1940)
years.remove(1950)
years.remove(1960)
years.remove(1970)
years.remove(1980)
years.remove(1990)
years.remove(2000)

years

#itereate over populaiton density to create and fill in a dataframe for each state
states_density = dict()
for k, v in pop_density.groupby('Name'):
  states_density[k] = v
  for year in years:
    states_density[k] = states_density[k].append({'Name':k,	'Geography Type':'State',	'Year':year}, ignore_index=True)
  states_density[k].sort_values(by = 'Year', inplace = True)

#use interpolate to approximate population density for each year in each state. 
for frame in states_density:
    states_density[frame].interpolate(method='linear', column = 'Resident Population Density', inplace=True)

#As an example for what is being stored, here are 1 DataFrame from states_density
states_density['WISCONSIN']

In order to be more workable, we need the state population densities to be in one large dataframe instead of a dictionary of dataframes. We wish to find a correlation between disease incidence and population density.

We hypothesize that higher population density would indicate a higher incidence as disease can mroe easily spread.

#put all state population density into one large dataframe
years = states_density['ALABAMA']['Year']
list_of_names = states_density.keys()
res_pop_dens = pd.DataFrame()
for name in list_of_names:
  res_pop_dens[name] = states_density[name]["Resident Population Density"]
res_pop_dens.insert(0, 'Year', years)

#set the index to year for easy accessing
res_pop_dens.set_index('Year', inplace=True)

Because we need to find the correlation for every state for every disease, it is best to write a function to calculate and append this information for us

def correlation_function(states, states_disease_incidence, res_pop_dens, name):
  #create empty array to store correlation data
  data = []
  for i in states: #calculates the correlation between incidence of disease and average population density for each state
    data.append(states_disease_incidence[[i]][i].corr(res_pop_dens[[i]][i]))
  #add this data into the larger Dataframe
  correl_density_df[name] = data

#Making correlation density dataframes per disease into 1 big dataframe
all_states = res_pop_dens.columns.unique()
correl_density_df = pd.DataFrame(columns = ['Location']) #creating a dataframe of 2 columns
correl_density_df['Location'] = all_states #filling in location with state names

#measles
correlation_function(all_states, states_measles_incidence, res_pop_dens, 'Correlation for Measles')

#Hepatitis A
correlation_function(all_states, states_hepa_incidence, res_pop_dens, 'Correlation for Hepatitis A')

#Mumps
correlation_function(all_states, states_mumps_incidence, res_pop_dens, 'Correlation for Mumps')

#Pertussis
correlation_function(all_states, states_pertussis_incidence, res_pop_dens, 'Correlation for Pertussis')

#Polio
correlation_function(all_states, states_polio_incidence, res_pop_dens, 'Correlation for Polio')

#Rubella
correlation_function(all_states, states_rubella_incidence, res_pop_dens, 'Correlation for Rubella')
correl_density_df.head()

#report the maximum and minimum correlation states for measles as well as the average
print('Max correlation for measles: ', correl_density_df['Location'].iloc[correl_density_df['Correlation for Measles'].idxmax()])
print(states_density['WEST VIRGINIA'])
print('Min correlation for measles: ', correl_density_df['Location'].iloc[correl_density_df['Correlation for Measles'].idxmin()])
print(states_density['NEW YORK'])
print('Avg correlation for measles: ', correl_density_df['Correlation for Measles'].mean())

Max correlation for measles:  WEST VIRGINIA
              Name Geography Type  Year  Resident Population Density
0    WEST VIRGINIA          State  1910                        50.80
12   WEST VIRGINIA          State  1911                        51.81
13   WEST VIRGINIA          State  1912                        52.82
14   WEST VIRGINIA          State  1913                        53.83
15   WEST VIRGINIA          State  1914                        54.84
..             ...            ...   ...                          ...
108  WEST VIRGINIA          State  2016                        75.60
109  WEST VIRGINIA          State  2017                        75.35
110  WEST VIRGINIA          State  2018                        75.10
111  WEST VIRGINIA          State  2019                        74.85
11   WEST VIRGINIA          State  2020                        74.60

[112 rows x 4 columns]
Min correlation for measles:  NEW YORK
         Name Geography Type  Year  Resident Population Density
0    NEW YORK          State  1910                       193.40
12   NEW YORK          State  1911                       196.10
13   NEW YORK          State  1912                       198.80
14   NEW YORK          State  1913                       201.50
15   NEW YORK          State  1914                       204.20
..        ...            ...   ...                          ...
108  NEW YORK          State  2016                       421.70
109  NEW YORK          State  2017                       423.45
110  NEW YORK          State  2018                       425.20
111  NEW YORK          State  2019                       426.95
11   NEW YORK          State  2020                       428.70

[112 rows x 4 columns]
Avg correlation for measles:  -0.5378643108469423

The average correlation between state population and incidence for that state for measles is -.53 which indicates a moderate negative correlation.

Our hypothesis was disproved as it would appear that as population density increases, incidence shows a moderate decrease. This could be due to bias on our assumption that all disease spread primarily through close contact with humans as well as an oversimplification of what variables may determine contraction.

The maximum correlation existed for West Virginia, which when compared to the minimum correlation of New York, has a considerably lower average population density. This indicates that perhaps there are other variables at play that may have some slight correlation to population density but do not have the same trends and hence lead to variable correlation values. This could be access to health care and abilty to be diagnosed and treated which can vary widely between states like WV and NY where city centers and population distribution appear very different.

Income Level¶

We will attempt to investigate whether average income level by state has any correlation with the disease prevalence by state.

Data source: https://apps.bea.gov/iTable/?reqid=70&step=1&acrdn=2#eyJhcHBpZCI6NzAsInN0ZXBzIjpbMSwyNCwyOSwyNSwzMSwyNiwyNywzMF0sImRhdGEiOltbIlRhYmxlSWQiLCIyMSJdLFsiQ2xhc3NpZmljYXRpb24iLCJOb24tSW5kdXN0cnkiXSxbIk1ham9yX0FyZWEiLCIwIl0sWyJTdGF0ZSIsWyIwIl1dLFsiQXJlYSIsWyJYWCJdXSxbIlN0YXRpc3RpYyIsWyIxIl1dLFsiVW5pdF9vZl9tZWFzdXJlIiwiTGV2ZWxzIl0sWyJZZWFyIixbIi0xIl1dLFsiWWVhckJlZ2luIiwiLTEiXSxbIlllYXJfRW5kIiwiLTEiXV19

This data source comes from the U.S. Bureau of Economic Analysis (BEA). It contains average incomes for each state over the years 1929 to 2021.

income_df = pd.read_csv("../finalProject/income.csv", sep=",", skiprows = 3)
#reading in income file as a dataframe
#we need to skip the first 3 rows because there are some text explanations written in these rows

income_df.head()
#original dataframe

Data Filtering

We do not have a use for the GeoFips column, which is just the geo-codes corresponding to each state. We also need to get rid of the footnote rows at the bottom of the document.

The datasource notes that estimates prior to 1950 are not available for Alaska and Hawaii.

import numpy as np
income_df
#income_df.drop(income_df[(income_df['GeoFips'] == object)].index, inplace=True)

income_df.drop(income_df.index[list(range(60, 65))], inplace=True)
#We only want to keep the first 60 rows because there are unnecessary lines at the end.

income_df.drop(columns = ["GeoFips"], inplace=True)
#We don't care about the GeoFips column

#We want to change the names that have asterisks to make the data look nicer
index_alaska = income_df.index[income_df['GeoName'] == 'Alaska *'].tolist()[0]
index_hawaii = income_df.index[income_df['GeoName'] == 'Hawaii *'].tolist()[0]
income_df.at[index_alaska, 'GeoName'] = 'Alaska'
income_df.at[index_hawaii, 'GeoName'] = 'Hawaii'

#income_df['GeoName'].loc[income_df['GeoName'].isin(['Alaska *'])] = 'Alaska'
#income_df['GeoName'].loc[income_df['GeoName'].isin(['Hawaii *'])] = 'Hawaii'

income_df.replace('(NA)', np.nan, inplace=True)
#Replacing NA values with NaN

income_df.isnull().sum()[income_df.isnull().sum() > 2]
#Since we know from the data source that 2 states do not have values before 1950, we check to see if there are any other missing values from the other 48 states. 
#We see that we return an empty series, so that means only two states had NaN's like we expected.

Series([], dtype: int64)

Now we display the resulting dataset after this cleanup:

income_df.head() #resulting dataset

Now we need to complete some transformations to prepare the dataset for analysis.

income_df1 = income_df.set_index('GeoName') #reset index to be by state
income_dfT1 = income_df1.T #transpose
income_dfT1.reset_index(inplace=True) #adding year back as a column to use for plotting
income_dfT1.columns = income_dfT1.columns.str.upper() #need these to be uppercase to join with population density dataset
income_dfT1.rename(columns={'INDEX':'year'}, inplace=True) #renaming column to make more clear
income_dfT1.drop(['NEW ENGLAND', 'MIDEAST', 'GREAT LAKES', 'PLAINS', 'SOUTHEAST', 'SOUTHWEST', 'ROCKY MOUNTAIN', 'FAR WEST'], axis=1, inplace=True) #we want to drop these columns so that they don't mess up any averaging
income_dfT1.head()

We need to check that the datatypes are correct.

income_dfT1.dtypes

GeoName
year                    object
UNITED STATES           object
ALABAMA                 object
ALASKA                  object
ARIZONA                 object
ARKANSAS                object
CALIFORNIA              object
COLORADO                object
CONNECTICUT             object
DELAWARE                object
DISTRICT OF COLUMBIA    object
FLORIDA                 object
GEORGIA                 object
HAWAII                  object
IDAHO                   object
ILLINOIS                object
INDIANA                 object
IOWA                    object
KANSAS                  object
KENTUCKY                object
LOUISIANA               object
MAINE                   object
MARYLAND                object
MASSACHUSETTS           object
MICHIGAN                object
MINNESOTA               object
MISSISSIPPI             object
MISSOURI                object
MONTANA                 object
NEBRASKA                object
NEVADA                  object
NEW HAMPSHIRE           object
NEW JERSEY              object
NEW MEXICO              object
NEW YORK                object
NORTH CAROLINA          object
NORTH DAKOTA            object
OHIO                    object
OKLAHOMA                object
OREGON                  object
PENNSYLVANIA            object
RHODE ISLAND            object
SOUTH CAROLINA          object
SOUTH DAKOTA            object
TENNESSEE               object
TEXAS                   object
UTAH                    object
VERMONT                 object
VIRGINIA                object
WASHINGTON              object
WEST VIRGINIA           object
WISCONSIN               object
WYOMING                 object
dtype: object

Oops! These are all objects, when we expect integers for years and floats for the income amounts.

cols_income = income_dfT1.columns
income_dfT1[cols_income] = income_dfT1[cols_income].apply(pd.to_numeric, errors='coerce') #changing datatypes from objects to floats/integers where appropriate

income_dfT1['ARIZONA'] #checking datatype

0        257.3
1        224.7
2        184.5
3        137.0
4        131.8
        ...   
88    299249.2
89    316896.3
90    340259.8
91    375601.3
92    403739.3
Name: ARIZONA, Length: 93, dtype: float64

income_dfT1.dtypes

GeoName
year                      int64
UNITED STATES           float64
ALABAMA                 float64
ALASKA                  float64
ARIZONA                 float64
ARKANSAS                float64
CALIFORNIA              float64
COLORADO                float64
CONNECTICUT             float64
DELAWARE                float64
DISTRICT OF COLUMBIA    float64
FLORIDA                 float64
GEORGIA                 float64
HAWAII                  float64
IDAHO                   float64
ILLINOIS                float64
INDIANA                 float64
IOWA                    float64
KANSAS                  float64
KENTUCKY                float64
LOUISIANA               float64
MAINE                   float64
MARYLAND                float64
MASSACHUSETTS           float64
MICHIGAN                float64
MINNESOTA               float64
MISSISSIPPI             float64
MISSOURI                float64
MONTANA                 float64
NEBRASKA                float64
NEVADA                  float64
NEW HAMPSHIRE           float64
NEW JERSEY              float64
NEW MEXICO              float64
NEW YORK                float64
NORTH CAROLINA          float64
NORTH DAKOTA            float64
OHIO                    float64
OKLAHOMA                float64
OREGON                  float64
PENNSYLVANIA            float64
RHODE ISLAND            float64
SOUTH CAROLINA          float64
SOUTH DAKOTA            float64
TENNESSEE               float64
TEXAS                   float64
UTAH                    float64
VERMONT                 float64
VIRGINIA                float64
WASHINGTON              float64
WEST VIRGINIA           float64
WISCONSIN               float64
WYOMING                 float64
dtype: object

Great! All fixed to be as expected.

Analysis

Now we plot income level by state over time just to get a quick visual of the data.

We only plot the max 7 states because 50 states would give us a whacky legend and a difficult graph to read.

#We sort the max income values series in descending order to grab the max 7 states, and then grab the names of those states
income_dfT1.max().sort_values(axis=0, ascending=False)[1:8].index.tolist() #Rows 0 is the United States, but we only want the max 10 states

#Then we plot
income_dfT1.plot(x = 'year', y=income_dfT1.max().sort_values(axis=0, ascending=False)[1:8].index.tolist()) 
plt.legend(income_dfT1.max().sort_values(axis=0, ascending=False)[1:8].index.tolist());
plt.title("Income Level Over Time per State")
plt.xlabel('Year')
plt.ylabel('Income')

Text(0, 0.5, 'Income')

The above plot is just useful to note that income generally increased extremely over time for each state, and it is evident that some states have significantly more average income than others, such as California.

It is important to note that the cost of living is not the same in each state as well.

income_dfT1.set_index('year', inplace=True)
income_dfT1.head()

all_states = states_measles_incidence.columns.unique() #grabbing the names of all the states

We wish to explore the relationship between incidence and average income per state.

We can calculate the correlation of the two for each state over all of the years that we have data available for, for each disease separately.

correl_income_df = pd.DataFrame(columns = ['Location']) #creating a dataframe of 2 columns
correl_income_df['Location'] = all_states #filling in location with state names

incidences = [states_measles_incidence, states_hepa_incidence, states_mumps_incidence, states_pertussis_incidence, states_polio_incidence, states_rubella_incidence, states_smallpox_incidence]

#Creating a function to automate the process of iterating through each disease
def correl_income_df_func(incidence, dis_name):
  meas_corr_data_func = []
  for i in all_states: #calculates the correlation between incidence of measles and average income of the state and appends to an array which we will throw into the dataframe later
    meas_corr_data_func.append(incidence[[i]][i].corr(income_dfT1[[i]][i]))
  correl_income_df[f'Correlation for {dis_name}'] = meas_corr_data_func #Creating a new column for this disease and throw in the correlation data

#Filling in the resultant correlation dataframe
correl_income_df_func(states_measles_incidence, 'Measles')
correl_income_df_func(states_hepa_incidence, 'Hepatitis A')
correl_income_df_func(states_mumps_incidence, 'Mumps')
correl_income_df_func(states_pertussis_incidence, 'Pertussis')
correl_income_df_func(states_polio_incidence, 'Polio')
correl_income_df_func(states_rubella_incidence, 'Rubella')

correl_income_df

We get a negative correlation each time as our result!

This makes sense because this is saying that for a higher income, the chances of contracting measles is less. Vice versa, for a lower income, the chances of contracting measles is higher. This does not establish causality, but it certainly displays an expected relationship!

Now we can compute some stats about our table.

print('Max correlation for measles: ', correl_income_df['Location'].iloc[correl_income_df['Correlation for Measles'].idxmax()])
print('Min correlation for measles: ', correl_income_df['Location'].iloc[correl_income_df['Correlation for Measles'].idxmin()])
print('Avg correlation for measles: ', correl_income_df['Correlation for Measles'].mean())

Max correlation for measles:  SOUTH DAKOTA
Min correlation for measles:  NEW YORK
Avg correlation for measles:  -0.4393191688736428

It is interesting to point out that the average correlation over all of the states for the entire period for measles is just under 50%.

We note that we know we cannot make strong claims on this data because income also varies between state based on average living costs.

In addition, we recognize it would have been better to have data for income level of each person who got sick since this income set is just the average income of the state including people who aren't sick. This is just to explore a relationship between overall state wealth and illness.

We can compute these stats for the other diseases too.

Analysis, Hypothesis Testing, & ML - Predictive Models¶

Our idea for a predictive model is to create a hypothetical situation and predict the chances that someone in a particular state would contract a certain disease in a couple of years after.

For instance, we could say, "Imagine you are living in Louisiana in 1955, and you want to know the chances of you catching Measles in 1967."

The independent variables used to train the model would be the year, state, disease, and income level and the dependent variable to be predicted on would be the probability that this person will contract the disease at hand.

In terms of the methodology to go about this, we can train a k-nearest neighbors regression model to predict the probability for contracting the disease for different values of k. We can calculate the training and validation mean absolute error of each model. Because this data involves patterns over time, we need to be careful in how we split our data. We cannot, for example, split our data in half, shuffle it up, and do regular cross-validation. The reason being is that this is somewhat 'cheating' because the model gets to see what the future trend looks like. We can plot our errors as a function of k to determine which is the optimal k to use in order to get our final predicted value.

In order to perform a more robust analysis, we can determine the effects of certain factors on our predictive model. For instance, we can test whether splitting the data differently or taking out certain features improves the error of our model.

Preparing a Dataframe for Modeling¶

We need to do a quick operation to prepare the income dataset with the disease incidence tables.

#removing unnecessary columns from income
income_dfT1.drop(income_dfT1.loc[2012:].index, inplace=True)
income_dfT1.drop(columns=['UNITED STATES'], inplace=True)
income_dfT1

We make our large dataset that contains all of our data - year, state, income, incidence of each disease, and population density.

from itertools import product
copy = income_dfT1
tuples = list(product(copy.index, copy.columns)) #creating all possible combinations
tuples

index = pd.MultiIndex.from_tuples(tuples, names=["year", "state"]) #creating multiindex to use in new datafra,e
altogether = pd.DataFrame(index=index, columns=['Income']) #creating empty dataframe to later fill in values for

#Adding in income into total dataframe
for i in income_dfT1: #states - cols
  for j in income_dfT1.index: #years - rows
    altogether.at[(j, i), 'Income'] = income_dfT1.loc[j, i]

#Creating a function that adds each disease incidence dataframe to total dataframe
def add_disease_inc(df, name):
  for i in df: #states - cols
    for j in df.index: #years - rows
      altogether.at[(j, i), f'{name}'] = df.loc[j, i]
#Calling function for each disease
add_disease_inc(states_measles_incidence, 'Measles Incidence')
add_disease_inc(states_mumps_incidence, 'Mumps Incidence')
add_disease_inc(states_hepa_incidence, 'Hep-A Incidence')
add_disease_inc(states_pertussis_incidence, 'Pertussis Incidence')
add_disease_inc(states_polio_incidence, 'Polio Incidence')
add_disease_inc(states_rubella_incidence, 'Rubella Incidence')
add_disease_inc(states_smallpox_incidence, 'Smallpox Incidence')

#Creating dataframe of population densities
d = { k: v.set_index(['Year', 'Name']) for k, v in states_density.items()} #setting index to Year and Name for every dataframe within the dictionary states_density
df_densities = pd.concat(d, axis=0) #putting all of the dataframes into 1 dataframe
df_densities.reset_index(inplace = True)
df_densities.set_index(['Year', 'Name'], inplace=True)
df_densities

#Merging population density dataframe with big dataframe
altogether = altogether.merge(df_densities, left_index=True, right_on=['Year', 'Name'])
altogether.drop(columns= ['level_0', 'Geography Type'], inplace=True) #Don't need these columns
altogether.rename(columns={"Resident Population Density": "Pop Density"}, inplace=True)

#Sorting in order and making year and state name individual columns
altogether.sort_index(inplace=True, ascending=True)
altogether.reset_index(inplace=True)

altogether

Now, we examine only the case of measles to make things simpler, and keep in mind that we can examine any disease.

Additionally, we will drop any nans here as any data at this point such as the disease incidence can not be linearly interpolated without potentially skewing and creating false trends.

#Create shorter version of altogether dataframe
measles_together = altogether.drop(columns=['Mumps Incidence', 'Hep-A Incidence', 'Pertussis Incidence', 'Polio Incidence', 'Rubella Incidence', 'Smallpox Incidence'])

#Drop nans
measles_together.dropna(inplace=True)

#Reset index after drop
measles_together.reset_index(inplace=True)
measles_together.drop(columns=['index'], inplace=True)

measles_together

Predictor 1 - Split Data in Half¶

We can create a predictor for measles that just splits the data in half.

With this predictor, we expect that our predicted incidence value will be much higher than in actuality because we have to predict on a year just after the vaccine came out- this is a piece of information that our model does not have nor is it trained to account for.

First, we split the data in half. Then, we define our features to train on as the year, state, income, and population density.

We can use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.

#JUST MEASLES PREDICTOR
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

# Fit model to data
from sklearn.neighbors import KNeighborsRegressor

# split data in half
# There are 2949 rows in measles_together, so 2949/2 = 1474.5
train = measles_together.iloc[0:1475]
val = measles_together.iloc[1475:]

# defining features
features = ["Year", 
            "Name",
            "Income",
            "Pop Density"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]

#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
  def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(X_train_sc, y_train)
    
    # Make predictions on the validation set.
    y_val_pred = model.predict(X_val_sc)
    rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
    
    return rmse
  
  val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
  val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
  cross_val = (val1 + val2)/2
  error_results.append(cross_val)

We plot these k values and the corresponding errors they give to determine the best k to use.

#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)
#xpos = np.where(error_results == ymin)
#xmin = xpos[0][0]
#print('xmin = ', xmin)

From the graph above, we see that the optimal k to use is 57. This makes sense because there are 50 states for a given year each of which are likely experiencing similar trends.

We don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.

Next, we take this k and make our predictions.

# Use k=57 to predict 
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
    
# Fit a 57-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=57)
model.fit(X_train_sc, y_train)
    
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())

#print(y_val_pred)
print('val error: ', ymin)

val error:  0.0032435563716947326

If we want to know the chances of catching measles in 1970 in Alabama, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe

val[val['Year'] == 1970]
#Find that 1970 ALabama is at index 1967 

#1967 - 1475 = 492 gives us the index in the predicted array
y_val_pred[492]

0.002460425878952384

If I want to know the chances of catching measles in 1970 in Alabama, the results show that the incidence is 0.002460425878952384.

The expected value is 0.000139. It makes sense that the obtained and expected values are off because we were predicting on a year right after the measles vaccine came out, and the incidence values drastically drop off at this point.

#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((.002460425878952384 - .000139)/ .000139) * 100

1670.090560397399

The percent error here is 1670%! Let's see if we can reduce this.

Predictor 2 - Splitting the Data Into Quarters¶

Now, to examine another way of splitting the data, we can split the data into about quarters and only examine the times that the diseases were prevalent before vaccine release. We predict that our predicted incidence value will be closer to the expected than in Model 1.

Specifically for measles, the vaccine was released in 1963. So we want to examine measles incidence from 1929 to 1963 instead of 1929 to 2002 in order to give our model a better shot of actively predicting.

In order to figure out which indices to use from measles_together, we can see the last index where we have data for the year 1963.

measles_together[measles_together['Year'] == 1963].tail()

The last index where we have data for the year 1963 is 1669. Therefore, we will split the data in half from indices 0 to 1669.

We proceed as we did last time but split the data differently.

First, we split this portion of data in half. Then, we define our features to train on as the year, state, income, and population density.

We will again use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.

#JUST MEASLES PREDICTOR
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

# Fit model to data
from sklearn.neighbors import KNeighborsRegressor

# split data in half up to vaccine release in 1963
# 1669/2 = 834.5
train = measles_together.iloc[0:835]
val = measles_together.iloc[835:1669]

# defining features
features = ["Year", 
            "Name",
            "Income",
            "Pop Density"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]

#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
  def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(X_train_sc, y_train)
    
    # Make predictions on the validation set.
    y_val_pred = model.predict(X_val_sc)
    rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
    
    return rmse
  
  val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
  val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
  cross_val = (val1 + val2)/2
  error_results.append(cross_val)

We plot these k values and the corresponding errors they give to determine the best k to use.

#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)

From the graph above, we see that the optimal k to use is about 25. This is most likely because we reduced our training/test size in half.

We again don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.

Next, we take this k and make our predictions.

# Use k=25 to predict 
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
    
# Fit a 25-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=25)
model.fit(X_train_sc, y_train)
    
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())

#print(y_val_pred)
print('val error: ', ymin)

val error:  0.003022676298335747

If we want to know the chances of catching measles in 1955 in Alabama (about halfway through the predicted dataset, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe.

val[val['Year'] == 1963]
#Find that 1955 ALabama is at index 1227

#1227 - 835 = 392 gives us the index in the predicted array
y_val_pred[392]

0.001797885794424771

If I want to know the chances of catching measles in 1963 in Alabama, the results show that the incidence is 0.001797885794424771, compared to the expected value 0.000347.

#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((.001797885794424771 - .000347)/ .000347) * 100

418.1227073270234

The percentage error here is 418%, which is still pretty large, but drastically reduced from model 1 by 847%!

Additionally as a note to keep in mind, these incidence numbers are so small that the percent difference between them seems extremely high - so the difference between the percents is important.

Predictor 3 - What Features Impact our Model?¶

Now, to examine what happens when remove a feature to predict on, we will remove the income feature. We predict based on our correlations between disease incidence and income that the error will go up.

Since our predictor 2 improved the percent error, we will build off of that model and examine the measles incidence from 1929 to 1963 right before the vaccine came out.

First, we split the data in half. Then, we define our features to train on as the year, state, and population density.

We can use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.

#JUST MEASLES PREDICTOR - no income
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

# Fit model to data
from sklearn.neighbors import KNeighborsRegressor

# split data in half up to vaccine release in 1963
# 1669/2 = 834.5
train = measles_together.iloc[0:835]
val = measles_together.iloc[835:1669]

# defining features WITHOUT INCOME
features = ["Year", 
            "Name",
            "Pop Density"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]

#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
  def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(X_train_sc, y_train)
    
    # Make predictions on the validation set.
    y_val_pred = model.predict(X_val_sc)
    rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
    
    return rmse
  
  val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
  val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
  cross_val = (val1 + val2)/2
  error_results.append(cross_val)

We plot these k values and the corresponding errors they give to determine the best k to use.

#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)

From the graph above, we see that the optimal k to use is about 25 again, similar to our Model 2.

We again don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.

Next, we take this k and make our predictions.

# Use k=25 to predict 
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
    
# Fit a 25-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=25)
model.fit(X_train_sc, y_train)
    
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())

#print(y_val_pred)
print('val error: ', ymin)

val error:  0.003024193805600776

If we want to know the chances of catching measles in 1955 in Alabama (about halfway through the predicted dataset, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe.

val[val['Year'] == 1955]
#Find that 1955 ALabama is at index 1227

#1227 - 835 = 392 gives us the index in the predicted array
y_val_pred[392]

0.002084132350947432

If I want to know the chances of catching measles in 1963 in Alabama, the results show that the incidence is 0.002084132350947432, compared to the expected value 0.000347.

#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((0.002084132350947432 - .000347)/ .000347) * 100

500.6145103594905

The percentage error here is 500%, which is still pretty large, and actually increased from Model 2 by 82%. This increase is as we expected and shows that the correlation we observed earlier between incidence and state income does in fact affect our model.

Predictor 4 - Removing Population Density¶

Now, we will examine what happens when remove the population density feature to predict on. We predict based on our correlations between disease incidence and population density that the error will go down. Since our predictor 2 improved the percent error, we will build off of that model.

Again, we examine the measles incidence from 1929 to 1963 right before the vaccine came out.

First, we split the data in half. Then, we define our features to train on as the year, state, and population density.

We can use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.

#JUST MEASLES PREDICTOR - no income
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler

# Fit model to data
from sklearn.neighbors import KNeighborsRegressor

# split data in half up to vaccine release in 1963
# 1669/2 = 834.5
train = measles_together.iloc[0:835]
val = measles_together.iloc[835:1669]

# defining features WITHOUT POPULATION DENSITY 
features = ["Year", 
            "Name",
            "Income"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]

#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
  def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(X_train_sc, y_train)
    
    # Make predictions on the validation set.
    y_val_pred = model.predict(X_val_sc)
    rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
    
    return rmse
  
  val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
  val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
  cross_val = (val1 + val2)/2
  error_results.append(cross_val)

We plot these k values and the corresponding errors they give to determine the best k to use.

#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)

From the graph above, we see that the optimal k to use is about 25 again, similar to our Model 2.

We again don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.

Next, we take this k and make our predictions.

# Use k=25 to predict 
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)

# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
    
# Fit a 25-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=25)
model.fit(X_train_sc, y_train)
    
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())

#print(y_val_pred)
print('val error: ', ymin)

val error:  0.003014331688575157

If we want to know the chances of catching measles in 1955 in Alabama (about halfway through the predicted dataset, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe.

val[val['Year'] == 1955]
#Find that 1955 ALabama is at index 1227

#1227 - 835 = 392 gives us the index in the predicted array
y_val_pred[392]

0.002045478209686451

If I want to know the chances of catching measles in 1963 in Alabama, the results show that the incidence is 0.002045478209686451, compared to the expected value 0.000347.

#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((0.002045478209686451 - .000347)/ .000347) * 100

489.4749883822625

The percentage error here is 489%, which increased from Model 2 by 71%. This increase is not what we expected; however, it has slightly less of an increase as compared to Model 3, meaning that population density may have had less of an effect than income.

%%shell

jupyter nbconvert --to html "/content/drive/MyDrive/cmps3160/finalProject/Final_Project_Disease_Prevalence.ipynb"

[NbConvertApp] WARNING | pattern '/content/drive/MyDrive/cmps3160/finalProject/Final_Project_Disease_Prevalence.ipynb' matched no files
This application is used to convert notebook files (*.ipynb)
        to various other formats.

        WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
    Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
    Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
    Write notebook output to stdout instead of files.
    Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
    Run nbconvert in place, overwriting the existing notebook (only 
            relevant when converting to notebook format)
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
    Clear output of current file and save in place, 
            overwriting the existing notebook.
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
    Exclude input and output prompts from converted document.
    Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
    Exclude input cells and output prompts from converted document. 
            This mode is ideal for generating code-free reports.
    Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True]
--log-level=<Enum>
    Set the log level by value or name.
    Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
    Default: 30
    Equivalent to: [--Application.log_level]
--config=<Unicode>
    Full path of a config file.
    Default: ''
    Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
    The export format to be used, either one of the built-in formats
            ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides']
            or a dotted object name that represents the import path for an
            `Exporter` class
    Default: 'html'
    Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
    Name of the template file to use
    Default: ''
    Equivalent to: [--TemplateExporter.template_file]
--writer=<DottedObjectName>
    Writer class used to write the 
                                        results of the conversion
    Default: 'FilesWriter'
    Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
    PostProcessor class used to write the
                                        results of the conversion
    Default: ''
    Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
    overwrite base name use for output files.
                can only be used when converting one notebook at a time.
    Default: ''
    Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
    Directory to write output(s) to. Defaults
                                  to output to the directory of each notebook. To recover
                                  previous default behaviour (outputting to the current 
                                  working directory) use . as the flag value.
    Default: ''
    Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
    The URL prefix for reveal.js (version 3.x).
            This defaults to the reveal CDN, but can be any url pointing to a copy 
            of reveal.js. 
            For speaker notes to work, this must be a relative path to a local 
            copy of reveal.js: e.g., "reveal.js".
            If a relative path is given, it must be a subdirectory of the
            current directory (from which the server is run).
            See the usage documentation
            (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
            for more details.
    Default: ''
    Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
    The nbformat version to write.
            Use this to downgrade notebooks.
    Choices: any of [1, 2, 3, 4]
    Default: 4
    Equivalent to: [--NotebookExporter.nbformat_version]

Examples
--------

    The simplest way to use nbconvert is

            > jupyter nbconvert mynotebook.ipynb

            which will convert mynotebook.ipynb to the default format (probably HTML).

            You can specify the export format with `--to`.
            Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides'].

            > jupyter nbconvert --to latex mynotebook.ipynb

            Both HTML and LaTeX support multiple output templates. LaTeX includes
            'base', 'article' and 'report'.  HTML includes 'basic' and 'full'. You
            can specify the flavor of the format used.

            > jupyter nbconvert --to html --template basic mynotebook.ipynb

            You can also pipe the output to stdout, rather than a file

            > jupyter nbconvert mynotebook.ipynb --stdout

            PDF is generated via latex

            > jupyter nbconvert mynotebook.ipynb --to pdf

            You can get (and serve) a Reveal.js-powered slideshow

            > jupyter nbconvert myslides.ipynb --to slides --post serve

            Multiple notebooks can be given at the command line in a couple of 
            different ways:

            > jupyter nbconvert notebook*.ipynb
            > jupyter nbconvert notebook1.ipynb notebook2.ipynb

            or you can specify the notebooks list in a config file, containing::

                c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

            > jupyter nbconvert --config mycfg.py

To see all available configurables, use `--help-all`.

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-240-38cb401cffd3> in <module>
----> 1 get_ipython().run_cell_magic('shell', '', '\njupyter nbconvert --to html "/content/drive/MyDrive/cmps3160/finalProject/Final_Project_Disease_Prevalence.ipynb"\n')

/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2357             with self.builtin_trap:
   2358                 args = (magic_arg_s, cell)
-> 2359                 result = fn(*args, **kwargs)
   2360             return result
   2361 

/usr/local/lib/python3.8/dist-packages/google/colab/_system_commands.py in _shell_cell_magic(args, cmd)
    107   result = _run_command(cmd, clear_streamed_output=False)
    108   if not parsed_args.ignore_errors:
--> 109     result.check_returncode()
    110   return result
    111 

/usr/local/lib/python3.8/dist-packages/google/colab/_system_commands.py in check_returncode(self)
    132   def check_returncode(self):
    133     if self.returncode:
--> 134       raise subprocess.CalledProcessError(
    135           returncode=self.returncode, cmd=self.args, output=self.output)
    136 

CalledProcessError: Command '
jupyter nbconvert --to html "/content/drive/MyDrive/cmps3160/finalProject/Final_Project_Disease_Prevalence.ipynb"
' returned non-zero exit status 255.

Conclusions and Summary¶

Throughout this project we completed a lot of analysis specifically on disease incidence and trends between states within the population. We utilized several visualization techniques to demonstrate the variances both across the years as well as between states both seperate and in comparison to national trends. In an effort to understand these numbers more in depth, we included datasets focused on state population density and state average income level. We measured the correlation between each of these variables and disease incidence for every state over the years, and discovered that both population density and state average income feature a moderate negative correlation. From here we built dour different predictive models in an attempt to really demonstrate what variables could be related, and we discovered that the strongest thing to improve our model's predictions was to pick a smaller time. All of this indicates that while population density and state average income are potentially pieces of the complicated puzzle of disease spread, they are certainly not the whole picture, and without a better understanding of the individual people that make up those case numbers, our knowledge and what assumptions we can make are limited.

	OBJECTID	REGION	DIVISION	STATEFP	STATENS	GEOID	STUSPS	NAME	MTFCC	FUNCSTAT	ALAND	AWATER	INTPTLAT	INTPTLON	Shape_Leng	Shape_Area	geometry
0	1	4	9	15	01779782	15	HI	Hawaii	G4000	A	1.663425e+10	1.167874e+10	+19.8097670	-155.5061027	2.419034e+06	3.268482e+10	MULTIPOLYGON (((-17361760.204 2164544.588, -17...
1	2	3	7	05	00068085	05	AR	Arkansas	G4000	A	1.347726e+11	2.959210e+09	+34.8955256	-092.4446262	2.656648e+06	2.053261e+11	POLYGON ((-10515267.713 4101325.818, -10515269...
2	3	4	8	35	00897535	35	NM	New Mexico	G4000	A	3.141611e+11	7.564385e+08	+34.4346843	-106.1316181	2.900368e+06	4.643927e+11	POLYGON ((-12138963.727 4106855.170, -12138964...
3	4	4	8	30	00767982	30	MT	Montana	G4000	A	3.769636e+11	3.868565e+09	+47.0511771	-109.6348174	4.607246e+06	8.201836e+11	POLYGON ((-12727480.235 5886876.586, -12727567...
4	5	1	2	36	01779796	36	NY	New York	G4000	A	1.220579e+11	1.923885e+10	+42.9133974	-075.5962723	3.212630e+06	2.637702e+11	MULTIPOLYGON (((-8866092.533 5160809.769, -886...

	ALABAMA	ALASKA	ARIZONA	ARKANSAS	CALIFORNIA	COLORADO	CONNECTICUT	DELAWARE	DISTRICT OF COLUMBIA	FLORIDA	...	SOUTH DAKOTA	TENNESSEE	TEXAS	UTAH	VERMONT	VIRGINIA	WASHINGTON	WEST VIRGINIA	WISCONSIN	WYOMING
year
1928	0.003350	NaN	0.002007	0.004818	0.000692	0.002070	0.006350	0.002562	0.005357	0.001196	...	0.001601	0.003154	0.000973	0.000169	0.003349	NaN	0.003448	0.001959	0.001246	0.002273
1929	0.001119	NaN	0.000549	0.000672	0.000728	0.000742	0.006148	0.002398	0.000942	0.000780	...	0.001678	0.000330	0.000713	0.000689	0.001053	NaN	0.002486	0.003801	0.010165	0.003121
1930	0.001570	NaN	0.004664	0.000535	0.007602	0.011328	0.001122	0.001092	0.001822	0.003566	...	0.003463	0.001799	0.000732	0.010448	0.002367	NaN	0.006316	0.001577	0.007485	0.003416
1931	0.003373	NaN	0.004977	0.000459	0.004775	0.004533	0.007905	0.010033	0.008329	0.002608	...	0.002124	0.001348	0.000395	0.000298	0.003184	NaN	0.001974	0.002914	0.005066	0.000607
1932	0.000102	NaN	0.000202	0.000054	0.002141	0.002229	0.003483	0.000159	0.000532	0.000136	...	0.000964	0.000690	0.000766	0.000139	0.011461	0.000534	0.006319	0.005997	0.009353	0.002422

	disease	increase	loc	number	population	incidence
year
1928	MEASLES	334.99	ALABAMA	8843	2640000	3.349621e-03
1929	MEASLES	111.93	ALABAMA	2959	2644000	1.119138e-03
1930	MEASLES	157.00	ALABAMA	4156	2647000	1.570079e-03
1931	MEASLES	337.29	ALABAMA	8934	2649000	3.372593e-03
1932	MEASLES	10.21	ALABAMA	270	2653000	1.017716e-04
...	...	...	...	...	...	...
1991	MEASLES	0.02	ALABAMA	1	4099156	2.439527e-07
1993	MEASLES	0.02	ALABAMA	1	4214202	2.372928e-07
1997	MEASLES	0.14	ALABAMA	6	4367935	1.373647e-06
1998	MEASLES	0.00	ALABAMA	0	4404701	0.000000e+00
2002	MEASLES	0.16	ALABAMA	7	4472420	1.565148e-06

	State	1928	1929	1930	1931	1932	1933	1934	1935	1936	...	1987	1988	1989	1990	1991	1993	1997	1998	2002	Geometry
0	ALABAMA	0.003350	0.001119	0.001570	0.003373	0.000102	0.000652	0.005903	0.002653	0.000208	...	9.940743e-07	2.481253e-07	0.000015	0.000007	2.439527e-07	2.372928e-07	0.000001	0.000000e+00	0.000002	POLYGON ((-9841333.855 3579579.495, -9841349.6...
1	ALASKA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	1.827622e-06	0.000000	0.000049	1.753792e-06	NaN	0.000002	2.580928e-05	NaN	MULTIPOLYGON (((-15108567.654 8339788.089, -15...
2	ARIZONA	0.002007	0.000549	0.004664	0.004977	0.000202	0.002960	0.002388	0.001350	0.005255	...	9.900478e-06	8.282294e-07	0.000040	0.000077	9.027138e-05	7.379275e-07	0.000002	2.047778e-06	NaN	POLYGON ((-12761162.105 4147165.875, -12761214...
3	ARKANSAS	0.004818	0.000672	0.000535	0.000459	0.000054	0.002933	0.003846	0.000803	0.000057	...	NaN	NaN	0.000002	0.000017	NaN	NaN	0.000001	NaN	NaN	POLYGON ((-10515267.713 4101325.818, -10515269...
4	CALIFORNIA	0.000692	0.000728	0.007602	0.004775	0.002141	0.004453	0.004233	0.004664	0.007735	...	2.719200e-05	2.515558e-05	0.000074	0.000320	6.442247e-05	2.014393e-06	0.000001	1.212574e-07	0.000000	MULTIPOLYGON (((-13060108.516 3854208.959, -13...

	ALABAMA	ALASKA	ARIZONA	ARKANSAS	CALIFORNIA	COLORADO	CONNECTICUT	DELAWARE	DISTRICT OF COLUMBIA	FLORIDA	...	SOUTH DAKOTA	TENNESSEE	TEXAS	UTAH	VERMONT	VIRGINIA	WASHINGTON	WEST VIRGINIA	WISCONSIN	WYOMING
year
1928	0.000023	NaN	0.000026	0.000005	0.000051	0.000070	0.000046	0.000034	0.000069	0.000015	...	0.000064	0.000013	0.000007	0.000018	0.000073	0.000006	0.000208	0.000114	0.000017	0.000027
1929	0.000022	NaN	0.000021	0.000002	0.000028	0.000013	0.000013	0.000021	0.000012	0.000024	...	0.000004	0.000036	0.000003	0.000008	0.000053	0.000099	0.000017	0.000034	0.000007	0.000013
1930	0.000025	NaN	0.000074	0.000042	0.000307	0.000063	0.000046	0.000025	0.000018	0.000008	...	0.000202	0.000023	0.000022	0.000006	0.000008	0.000005	0.000030	0.000024	0.000061	0.000243
1931	0.000020	NaN	0.000026	0.000009	0.000043	0.000009	0.000700	0.000021	0.000030	0.000011	...	0.000042	0.000018	0.000006	0.000010	0.000248	0.000010	0.000054	0.000047	0.000246	0.000026
1932	0.000013	NaN	0.000016	0.000008	0.000029	0.000008	0.000020	0.000045	0.000066	0.000005	...	0.000022	0.000020	0.000012	0.000006	0.000006	0.000022	0.000047	0.000025	0.000019	0.000026

	Disease	Year
0	Smallpox	1798
1	Pertussis	1915
2	Polio	1955
3	Measles	1963
4	Rubella	1971
5	Mumps	1971
6	Hepatitis A	1995

	ALABAMA	ALASKA	ARIZONA	ARKANSAS	CALIFORNIA	COLORADO	CONNECTICUT	DELAWARE	DISTRICT OF COLUMBIA	FLORIDA	...	SOUTH DAKOTA	TENNESSEE	TEXAS	UTAH	VERMONT	VIRGINIA	WASHINGTON	WEST VIRGINIA	WISCONSIN	WYOMING
year
1938	0.000661	NaN	0.003489	0.000804	0.002271	0.001477	0.002166	0.002292	0.000925	0.000458	...	0.001160	0.000736	0.001573	0.004725	0.006935	0.001475	0.002615	0.001649	0.004112	0.002272
1939	0.000702	NaN	0.002138	0.000483	0.001245	0.001817	0.002378	0.002897	0.002271	0.000607	...	0.000419	0.000785	0.000870	0.005843	0.008475	0.001275	0.000945	0.000809	0.003520	0.001907
1940	0.000434	NaN	0.001788	0.000415	0.001823	0.001107	0.001799	0.002264	0.001094	0.000349	...	0.000284	0.000722	0.001278	0.007196	0.003793	0.001134	0.001164	0.000966	0.002249	0.000828
1941	0.000540	NaN	0.002400	0.000568	0.002622	0.004182	0.001591	0.001004	0.000937	0.000371	...	0.000954	0.000865	0.001620	0.005018	0.001828	0.001467	0.002582	0.001201	0.002861	0.001534
1942	0.000558	NaN	0.002048	0.000464	0.001556	0.001626	0.002181	0.000545	0.001090	0.000381	...	0.000411	0.000656	0.001155	0.002614	0.006061	0.000847	0.001579	0.000765	0.003420	0.001382

	ALABAMA	ARIZONA	ARKANSAS	CALIFORNIA	COLORADO	CONNECTICUT	DELAWARE	DISTRICT OF COLUMBIA	FLORIDA	GEORGIA	...	SOUTH DAKOTA	TENNESSEE	TEXAS	UTAH	VERMONT	VIRGINIA	WASHINGTON	WEST VIRGINIA	WISCONSIN	WYOMING
year
1928	0.000129	0.000972	0.000134	0.000185	0.000335	0.000102	0.000009	0.000021	0.000110	0.000005	...	0.000511	0.000258	0.000316	0.000821	0.000028	8.213552e-07	0.001031	0.000767	0.000283	0.000732
1929	0.000143	0.000642	0.000107	0.000363	0.000837	0.000027	0.000008	0.000000	0.000025	0.000019	...	0.001587	0.000094	0.000424	0.000327	0.000387	2.597938e-05	0.001329	0.000351	0.000232	0.001498
1930	0.000073	0.001263	0.000221	0.000402	0.000540	0.000000	0.000000	0.000000	0.000019	0.000000	...	0.002007	0.000162	0.000468	0.000053	0.000219	4.120313e-06	0.001496	0.000451	0.000308	0.001111
1931	0.000111	0.000107	0.000407	0.000239	0.000251	0.000061	0.000000	0.000000	0.000019	0.000016	...	0.000964	0.000133	0.000305	0.000033	0.000663	4.907975e-06	0.000657	0.000144	0.000124	0.000236
1932	0.000176	0.000016	0.000196	0.000084	0.000057	0.000053	0.000000	0.000002	0.000022	0.000005	...	0.000228	0.000162	0.000179	0.000008	0.000486	4.078303e-06	0.000420	0.000035	0.000029	0.000100

	ALABAMA	ALASKA	ARIZONA	ARKANSAS	CALIFORNIA	COLORADO	CONNECTICUT	DELAWARE	DISTRICT OF COLUMBIA	FLORIDA	...	SOUTH DAKOTA	TENNESSEE	TEXAS	UTAH	VERMONT	VIRGINIA	WASHINGTON	WEST VIRGINIA	WISCONSIN	WYOMING
year
1966	0.000093	0.000332	0.000257	0.000157	0.000315	0.000145	0.000115	0.000153	0.000048	0.000106	...	0.000020	0.000261	0.000172	0.000168	0.000056	0.000130	0.000189	0.000137	0.000088	0.000167
1967	0.000084	0.000140	0.000237	0.000155	0.000390	0.000143	0.000199	0.000330	0.000053	0.000103	...	0.000019	0.000217	0.000257	0.000155	0.000035	0.000157	0.000277	0.000205	0.000104	0.000202
1968	0.000091	0.000295	0.000253	0.000087	0.000558	0.000337	0.000169	0.000273	0.000086	0.000173	...	0.000115	0.000267	0.000202	0.000252	0.000156	0.000127	0.000364	0.000189	0.000125	0.000114
1969	0.000110	0.000213	0.000304	0.000099	0.000459	0.000271	0.000206	0.000124	0.000076	0.000204	...	0.000129	0.000260	0.000209	0.000271	0.000341	0.000093	0.000377	0.000194	0.000095	0.000246
1970	0.000118	0.000358	0.000324	0.000107	0.000463	0.000397	0.000281	0.000239	0.000116	0.000252	...	0.000031	0.000316	0.000238	0.000282	0.000808	0.000285	0.000399	0.000216	0.000108	0.000244

	ALABAMA	ALASKA	ARIZONA	ARKANSAS	CALIFORNIA	COLORADO	CONNECTICUT	DELAWARE	DISTRICT OF COLUMBIA	FLORIDA	...	SOUTH DAKOTA	TENNESSEE	TEXAS	UTAH	VERMONT	VIRGINIA	WASHINGTON	WEST VIRGINIA	WISCONSIN	WYOMING
year
1966	0.000032	0.000413	0.001595	0.000005	0.000149	0.000370	0.000756	0.000103	0.000019	0.000231	...	0.000003	0.000676	0.000013	0.000080	0.000385	0.000216	0.001024	0.000573	0.001194	0.000740
1967	0.000062	0.001137	0.000593	0.000009	0.000473	0.000882	0.000622	0.000141	0.000009	0.000184	...	0.000004	0.000333	0.000059	0.000080	0.000506	0.000144	0.000955	0.000334	0.000713	0.000012
1968	0.000117	0.001014	0.000416	0.000002	0.000252	0.000418	0.001025	0.000281	0.000018	0.000229	...	NaN	0.000293	0.000265	0.000107	0.000212	0.000140	0.000561	0.000499	0.000686	0.000037
1969	0.000040	0.001574	0.000471	0.000104	0.000306	0.000635	0.000548	0.000365	0.000209	0.000299	...	NaN	0.000428	0.000378	0.000146	0.000275	0.000334	0.000540	0.001348	0.000720	0.000246
1970	0.000109	0.000269	0.000328	0.000019	0.000258	0.000188	0.000203	0.000081	0.000028	0.000466	...	0.000006	0.000381	0.000723	0.000157	0.000150	0.000162	0.001415	0.000800	0.000514	0.000400

	State	1968	1969	1970	1971	1972	1973	1974	1975	1976	...	1993	1994	1995	1996	1997	1998	1999	2000	2002	Geometry
0	ALABAMA	0.000159	0.000068	0.000112	0.000290	0.000237	0.000208	0.000167	0.000121	0.000103	...	0.000005	0.000003	4.654627e-07	1.616217e-06	0.000002	9.081207e-07	0.000002	8.985031e-07	4.471852e-07	POLYGON ((-9841333.855 3579579.495, -9841349.6...
1	ALASKA	0.002519	0.003108	0.000919	0.000287	0.000771	0.002557	0.000396	0.000130	0.000094	...	0.000018	0.000007	1.985401e-05	4.929597e-06	0.000007	3.226160e-06	0.000005	1.274902e-05	NaN	MULTIPOLYGON (((-15108567.654 8339788.089, -15...
2	ARIZONA	0.001197	0.001210	0.000521	0.000620	0.000459	0.000074	NaN	NaN	NaN	...	0.000005	0.000023	1.353638e-06	2.180103e-07	0.000007	1.228667e-06	0.000002	1.161284e-06	0.000000e+00	POLYGON ((-12761162.105 4147165.875, -12761214...
3	ARKANSAS	0.000010	0.000047	0.000079	0.000078	0.000105	0.000160	0.000085	0.000085	0.000037	...	0.000002	0.000002	3.944152e-06	3.887860e-07	0.000000	3.426889e-06	NaN	1.866864e-06	NaN	POLYGON ((-10515267.713 4101325.818, -10515269...
4	CALIFORNIA	0.000871	0.000407	0.000306	0.000365	0.000341	0.000268	0.000116	0.000170	0.000079	...	0.000005	0.000005	6.467574e-06	3.185625e-06	0.000004	1.940119e-06	0.000002	1.853237e-06	1.892408e-06	MULTIPOLYGON (((-13060108.516 3854208.959, -13...

	Name	Geography Type	Year	Resident Population Density
0	Alabama	State	1910	42.2
1	Alaska	State	1910	0.1
2	Arizona	State	1910	1.8
3	Arkansas	State	1910	30.3
4	California	State	1910	15.3
...	...	...	...	...
679	Midwest Region	Region	2020	NaN
680	Northeast Region	Region	2020	NaN
681	South Region	Region	2020	NaN
682	West Region	Region	2020	NaN
683	United States	Nation	2020	93.8

	Location	Correlation for Measles	Correlation for Hepatitis A	Correlation for Mumps	Correlation for Pertussis	Correlation for Polio	Correlation for Rubella
0	ALABAMA	-0.564066	-0.862142	-0.760208	-0.820631	-0.042730	-0.691614
1	ALASKA	-0.625996	-0.430000	-0.586364	-0.591051	-0.411121	-0.632208
2	ARIZONA	-0.556976	-0.649866	-0.636456	-0.783381	-0.082528	-0.478183
3	ARKANSAS	-0.374926	-0.700985	-0.378505	-0.711537	0.159685	-0.384886
4	CALIFORNIA	-0.681544	-0.894775	-0.650937	-0.718160	-0.214112	-0.684958

	GeoFips	GeoName	1929	1930	1931	1932	1933	1934	1935	1936	...	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
0	00000	United States	85151.0	76394.0	65531.0	50162.0	47114.0	53967.0	60704.0	69063.0	...	14003346.0	14189228.0	14969527.0	15681233.0	16092713.0	16837337.0	17671054.0	18575467.0	19812171.0	21288709.0
1	01000	Alabama	843.2	697.5	583.7	421.9	435.6	556.3	586.9	683.2	...	173361.7	175185.3	181078.8	189115.4	192363.5	199191.8	206712.0	216587.5	232040.3	250828.8
2	02000	Alaska *	(NA)	(NA)	(NA)	(NA)	(NA)	(NA)	(NA)	(NA)	...	39266.0	38978.0	41210.0	42555.1	41837.4	42431.4	43981.2	45056.8	45965.1	48219.2
3	04000	Arizona	257.3	224.7	184.5	137.0	131.8	154.9	180.4	204.8	...	237809.1	243752.3	257272.3	270815.5	282085.1	299249.2	316896.3	340259.8	375601.3	403739.3
4	05000	Arkansas	560.4	416.5	381.8	278.5	283.2	341.8	386.0	460.2	...	107877.3	108367.8	114743.3	118719.5	121825.9	125285.9	130033.4	133161.4	142038.5	153185.8

	Location	Correlation for Measles	Correlation for Hepatitis A	Correlation for Mumps	Correlation for Pertussis	Correlation for Polio	Correlation for Rubella
0	ALABAMA	-0.438864	-0.787450	-0.673812	-0.619376	-0.182555	-0.553569
1	ALASKA	-0.504059	-0.421837	-0.557576	-0.223854	-0.281991	-0.580474
2	ARIZONA	-0.471913	-0.687155	-0.526443	-0.605962	-0.138901	-0.372119
3	ARKANSAS	-0.374091	-0.685318	-0.443986	-0.603096	0.024448	-0.342586
4	CALIFORNIA	-0.491681	-0.840580	-0.561661	-0.494432	-0.292186	-0.597785
5	COLORADO	-0.473118	-0.790049	-0.520779	-0.477964	-0.030515	-0.475452
6	CONNECTICUT	-0.494054	-0.676581	-0.587232	-0.566260	-0.196509	-0.491188
7	DELAWARE	-0.352530	-0.691063	-0.706025	-0.448722	0.016508	-0.387056
8	DISTRICT OF COLUMBIA	-0.430254	-0.381729	-0.315241	-0.562689	-0.055079	-0.272922
9	FLORIDA	-0.393899	-0.701293	-0.611818	-0.532889	0.012449	-0.530016
10	GEORGIA	-0.320447	-0.688778	-0.263598	-0.559281	-0.015368	-0.351515
11	HAWAII	-0.444390	-0.669737	-0.517535	-0.203510	-0.416792	-0.426629
12	IDAHO	-0.460630	-0.692686	-0.566743	-0.471422	0.077795	-0.487007
13	ILLINOIS	-0.471243	-0.758112	-0.707377	-0.510833	-0.119096	-0.629050
14	INDIANA	-0.471012	-0.784077	-0.626504	-0.644863	-0.023369	-0.595665
15	IOWA	-0.418800	-0.603993	-0.603113	-0.464908	0.149623	-0.573497
16	KANSAS	-0.466032	-0.741988	-0.662507	-0.538337	0.096425	-0.429905
17	KENTUCKY	-0.439927	-0.750518	-0.641025	-0.635103	0.041686	-0.559005
18	LOUISIANA	-0.447005	-0.810822	-0.254834	-0.463114	0.056014	-0.485844
19	MAINE	-0.479206	-0.516104	-0.567805	-0.633887	-0.046865	-0.556266
20	MARYLAND	-0.383021	-0.663977	-0.563129	-0.560587	-0.059396	-0.474399
21	MASSACHUSETTS	-0.553791	-0.489817	-0.459569	-0.559990	-0.107367	-0.560347
22	MICHIGAN	-0.500868	-0.737273	-0.675314	-0.609570	-0.119926	-0.729918
23	MINNESOTA	-0.373499	-0.697321	-0.530679	-0.444325	-0.070244	-0.499605
24	MISSISSIPPI	-0.377433	-0.732160	-0.561017	-0.385160	0.055086	-0.373902
25	MISSOURI	-0.439282	-0.600504	-0.655783	-0.521782	0.021800	-0.447154
26	MONTANA	-0.500560	-0.726915	-0.627714	-0.438677	0.012123	-0.584399
27	NEBRASKA	-0.398446	-0.537230	-0.632497	-0.498075	0.176869	-0.289057
28	NEVADA	-0.312422	-0.669026	-0.402603	-0.305455	0.116253	-0.503912
29	NEW HAMPSHIRE	-0.384313	-0.542423	-0.553658	-0.561887	0.138748	-0.474458
30	NEW JERSEY	-0.519256	-0.699206	-0.600998	-0.593741	-0.213640	-0.563064
31	NEW MEXICO	-0.452374	-0.796322	-0.514237	-0.517054	-0.087182	-0.526129
32	NEW YORK	-0.575878	-0.654657	-0.635314	-0.550308	-0.229062	-0.525775
33	NORTH CAROLINA	-0.316167	-0.606512	-0.338804	-0.457097	-0.043573	-0.249987
34	NORTH DAKOTA	-0.433816	-0.499573	-0.555261	-0.477584	0.070357	-0.664824
35	OHIO	-0.500542	-0.804153	-0.703410	-0.555504	-0.090723	-0.687021
36	OKLAHOMA	-0.422502	-0.475569	-0.528256	-0.584851	-0.033737	-0.382634
37	OREGON	-0.523861	-0.796372	-0.553439	-0.599129	0.101379	-0.675984
38	PENNSYLVANIA	-0.490885	-0.650329	-0.602147	-0.572429	-0.132532	-0.438337
39	RHODE ISLAND	-0.394296	-0.617608	-0.605285	-0.621437	0.111882	-0.334214
40	SOUTH CAROLINA	-0.392597	-0.645628	-0.478741	-0.499401	-0.039939	-0.537443
41	SOUTH DAKOTA	-0.265881	-0.321733	-0.391740	-0.422224	0.151902	0.120758
42	TENNESSEE	-0.398904	-0.701850	-0.613782	-0.597027	0.003533	-0.582133
43	TEXAS	-0.452337	-0.842644	-0.603819	-0.575867	0.044653	-0.408034
44	UTAH	-0.353990	-0.652629	-0.518282	-0.423923	0.116972	-0.612660
45	VERMONT	-0.401079	-0.493054	-0.474270	-0.595264	-0.091076	-0.390664
46	VIRGINIA	-0.473576	-0.611603	-0.616531	-0.568281	-0.071741	-0.546190
47	WASHINGTON	-0.518176	-0.715003	-0.589229	-0.436118	-0.179919	-0.477765
48	WEST VIRGINIA	-0.513411	-0.737397	-0.782945	-0.668567	0.002412	-0.677053
49	WISCONSIN	-0.547927	-0.723717	-0.628660	-0.543277	0.078701	-0.739234
50	WYOMING	-0.391032	-0.619877	-0.521698	-0.547756	0.161935	-0.331384

	disease	increase	loc	number	population	incidence
year
1928	POLIO	2.39	ALABAMA	62	2640000	2.348485e-05
1929	POLIO	2.25	ALABAMA	58	2644000	2.193646e-05
1930	POLIO	2.57	ALABAMA	67	2647000	2.531167e-05
1931	POLIO	2.07	ALABAMA	54	2649000	2.038505e-05
1932	POLIO	1.38	ALABAMA	35	2653000	1.319261e-05
1933	POLIO	1.12	ALABAMA	28	2661000	1.052236e-05
1934	POLIO	1.89	ALABAMA	50	2685000	1.862197e-05
1935	POLIO	2.27	ALABAMA	61	2719000	2.243472e-05
1936	POLIO	15.36	ALABAMA	420	2743000	1.531170e-04
1937	POLIO	3.07	ALABAMA	84	2762000	3.041274e-05
1938	POLIO	3.36	ALABAMA	96	2787000	3.444564e-05
1939	POLIO	1.98	ALABAMA	56	2814000	1.990050e-05
1940	POLIO	1.62	ALABAMA	51	2845000	1.792619e-05
1941	POLIO	27.43	ALABAMA	796	2902000	2.742936e-04
1942	POLIO	4.61	ALABAMA	139	2941000	4.726284e-05
1943	POLIO	1.27	ALABAMA	40	2902000	1.378360e-05
1944	POLIO	3.39	ALABAMA	98	2802000	3.497502e-05
1945	POLIO	5.00	ALABAMA	140	2775000	5.045045e-05
1946	POLIO	10.85	ALABAMA	318	2911000	1.092408e-04
1947	POLIO	2.18	ALABAMA	68	2942000	2.311353e-05
1948	POLIO	6.83	ALABAMA	208	2969000	7.005726e-05
1949	POLIO	8.55	ALABAMA	256	3000000	8.533333e-05
1950	POLIO	9.13	ALABAMA	279	3058000	9.123610e-05
1951	POLIO	22.15	ALABAMA	677	3059000	2.213142e-04
1952	POLIO	13.56	ALABAMA	419	3068000	1.365711e-04
1953	POLIO	14.93	ALABAMA	457	3053000	1.496888e-04
1954	POLIO	12.47	ALABAMA	376	3014000	1.247512e-04
1955	POLIO	5.52	ALABAMA	169	3050000	5.540984e-05
1956	POLIO	3.46	ALABAMA	105	3071000	3.419082e-05
1957	POLIO	1.85	ALABAMA	59	3109000	1.897716e-05
1958	POLIO	1.92	ALABAMA	62	3163000	1.960164e-05
1959	POLIO	4.79	ALABAMA	154	3204000	4.806492e-05
1960	POLIO	0.69	ALABAMA	23	3274000	7.025046e-06
1961	POLIO	0.33	ALABAMA	11	3316000	3.317250e-06
1962	POLIO	0.66	ALABAMA	22	3323000	6.620524e-06
1963	POLIO	1.59	ALABAMA	53	3358000	1.578320e-05
1964	POLIO	0.06	ALABAMA	2	3395000	5.891016e-07
1966	POLIO	0.00	ALABAMA	0	3464000	0.000000e+00
1968	POLIO	0.03	ALABAMA	1	3446000	2.901915e-07

	disease	increase	loc	number	population	incidence
year
1938	PERTUSSIS	66.01	ALABAMA	1843	2787000	6.612845e-04
1939	PERTUSSIS	69.98	ALABAMA	1975	2814000	7.018479e-04
1940	PERTUSSIS	43.18	ALABAMA	1235	2845000	4.340949e-04
1941	PERTUSSIS	53.93	ALABAMA	1566	2902000	5.396278e-04
1942	PERTUSSIS	55.64	ALABAMA	1640	2941000	5.576335e-04
1943	PERTUSSIS	70.66	ALABAMA	2054	2902000	7.077877e-04
1944	PERTUSSIS	46.83	ALABAMA	1314	2802000	4.689507e-04
1945	PERTUSSIS	38.78	ALABAMA	1081	2775000	3.895495e-04
1946	PERTUSSIS	35.99	ALABAMA	1048	2911000	3.600137e-04
1947	PERTUSSIS	76.40	ALABAMA	2249	2942000	7.644460e-04
1948	PERTUSSIS	42.48	ALABAMA	1263	2969000	4.253958e-04
1949	PERTUSSIS	17.20	ALABAMA	515	3000000	1.716667e-04
1950	PERTUSSIS	38.47	ALABAMA	1177	3058000	3.848921e-04
1951	PERTUSSIS	39.51	ALABAMA	1209	3059000	3.952272e-04
1952	PERTUSSIS	30.46	ALABAMA	945	3068000	3.080183e-04
1953	PERTUSSIS	10.38	ALABAMA	317	3053000	1.038323e-04
1954	PERTUSSIS	24.79	ALABAMA	748	3014000	2.481752e-04
1955	PERTUSSIS	58.12	ALABAMA	1774	3050000	5.816393e-04
1974	PERTUSSIS	0.52	ALABAMA	19	3678814	5.164708e-06
1975	PERTUSSIS	1.24	ALABAMA	46	3735139	1.231547e-05
1976	PERTUSSIS	0.17	ALABAMA	6	3780403	1.587132e-06
1977	PERTUSSIS	0.11	ALABAMA	4	3831836	1.043886e-06
1978	PERTUSSIS	0.08	ALABAMA	3	3866248	7.759461e-07
1979	PERTUSSIS	0.12	ALABAMA	4	3893888	1.027251e-06
1980	PERTUSSIS	0.11	ALABAMA	4	3918531	1.020791e-06
1982	PERTUSSIS	0.14	ALABAMA	5	3934102	1.270938e-06
1983	PERTUSSIS	0.11	ALABAMA	4	3951820	1.012192e-06
1984	PERTUSSIS	0.03	ALABAMA	1	3972523	2.517292e-07
1985	PERTUSSIS	0.77	ALABAMA	29	3991569	7.265313e-06
1986	PERTUSSIS	0.52	ALABAMA	24	4015264	5.977191e-06
1987	PERTUSSIS	0.59	ALABAMA	26	4023844	6.461483e-06
1988	PERTUSSIS	1.48	ALABAMA	62	4030222	1.538377e-05
1989	PERTUSSIS	2.32	ALABAMA	97	4040587	2.400641e-05
1990	PERTUSSIS	1.77	ALABAMA	76	4050055	1.876518e-05
1991	PERTUSSIS	1.53	ALABAMA	65	4099156	1.585692e-05
1992	PERTUSSIS	0.51	ALABAMA	24	4154014	5.777544e-06
1993	PERTUSSIS	1.54	ALABAMA	67	4214202	1.589862e-05
1994	PERTUSSIS	0.88	ALABAMA	39	4260229	9.154437e-06
1995	PERTUSSIS	0.88	ALABAMA	38	4296800	8.843791e-06

	disease	increase	loc	number	population	incidence
year
1928	SMALLPOX	12.96	ALABAMA	341	2640000	1.291667e-04
1929	SMALLPOX	14.33	ALABAMA	378	2644000	1.429652e-04
1930	SMALLPOX	7.28	ALABAMA	192	2647000	7.253495e-05
1931	SMALLPOX	11.18	ALABAMA	295	2649000	1.113628e-04
1932	SMALLPOX	17.64	ALABAMA	467	2653000	1.760271e-04
1933	SMALLPOX	3.14	ALABAMA	82	2661000	3.081548e-05
1934	SMALLPOX	0.87	ALABAMA	23	2685000	8.566108e-06
1935	SMALLPOX	1.55	ALABAMA	42	2719000	1.544686e-05
1936	SMALLPOX	0.45	ALABAMA	12	2743000	4.374772e-06
1937	SMALLPOX	1.97	ALABAMA	54	2762000	1.955105e-05
1938	SMALLPOX	2.55	ALABAMA	74	2787000	2.655185e-05
1939	SMALLPOX	1.74	ALABAMA	52	2814000	1.847903e-05
1940	SMALLPOX	1.65	ALABAMA	49	2845000	1.722320e-05
1941	SMALLPOX	0.34	ALABAMA	12	2902000	4.135079e-06
1942	SMALLPOX	0.41	ALABAMA	13	2941000	4.420265e-06
1943	SMALLPOX	0.61	ALABAMA	20	2902000	6.891799e-06
1944	SMALLPOX	0.42	ALABAMA	12	2802000	4.282655e-06
1945	SMALLPOX	0.20	ALABAMA	5	2775000	1.801802e-06
1946	SMALLPOX	0.16	ALABAMA	5	2911000	1.717623e-06
1947	SMALLPOX	0.18	ALABAMA	6	2942000	2.039429e-06
1948	SMALLPOX	0.45	ALABAMA	15	2969000	5.052206e-06
1951	SMALLPOX	0.03	ALABAMA	1	3059000	3.269042e-07

	disease	increase	loc	number	population	incidence
year
1966	HEPATITIS A	9.28	ALABAMA	321	3464000	0.000093
1967	HEPATITIS A	8.42	ALABAMA	291	3458000	0.000084
1968	HEPATITIS A	9.18	ALABAMA	314	3446000	0.000091
1969	HEPATITIS A	11.02	ALABAMA	380	3440000	0.000110
1970	HEPATITIS A	11.80	ALABAMA	413	3497076	0.000118
1971	HEPATITIS A	10.66	ALABAMA	378	3539400	0.000107
1972	HEPATITIS A	9.57	ALABAMA	342	3579780	0.000096
1973	HEPATITIS A	12.93	ALABAMA	467	3626499	0.000129
1974	HEPATITIS A	6.58	ALABAMA	244	3678814	0.000066
1975	HEPATITIS A	7.65	ALABAMA	286	3735139	0.000077
1976	HEPATITIS A	5.86	ALABAMA	220	3780403	0.000058
1977	HEPATITIS A	5.38	ALABAMA	206	3831836	0.000054
1978	HEPATITIS A	5.29	ALABAMA	203	3866248	0.000053
1979	HEPATITIS A	6.64	ALABAMA	257	3893888	0.000066
1980	HEPATITIS A	5.09	ALABAMA	200	3918531	0.000051
1981	HEPATITIS A	3.59	ALABAMA	138	3925266	0.000035
1982	HEPATITIS A	3.33	ALABAMA	128	3934102	0.000033
1983	HEPATITIS A	4.56	ALABAMA	178	3951820	0.000045
1984	HEPATITIS A	2.97	ALABAMA	115	3972523	0.000029
1985	HEPATITIS A	1.01	ALABAMA	37	3991569	0.000009
1986	HEPATITIS A	1.18	ALABAMA	51	4015264	0.000013
1987	HEPATITIS A	1.91	ALABAMA	81	4023844	0.000020
1988	HEPATITIS A	0.30	ALABAMA	13	4030222	0.000003
1989	HEPATITIS A	1.29	ALABAMA	56	4040587	0.000014
1990	HEPATITIS A	2.05	ALABAMA	86	4050055	0.000021
1991	HEPATITIS A	0.87	ALABAMA	39	4099156	0.000010
1992	HEPATITIS A	0.81	ALABAMA	35	4154014	0.000008
1993	HEPATITIS A	0.94	ALABAMA	40	4214202	0.000009
1994	HEPATITIS A	1.68	ALABAMA	72	4260229	0.000017
1995	HEPATITIS A	1.77	ALABAMA	75	4296800	0.000017
1996	HEPATITIS A	4.55	ALABAMA	196	4331102	0.000045
1997	HEPATITIS A	0.90	ALABAMA	40	4367935	0.000009
1998	HEPATITIS A	1.47	ALABAMA	64	4404701	0.000015
1999	HEPATITIS A	1.17	ALABAMA	52	4430141	0.000012
2000	HEPATITIS A	1.15	ALABAMA	53	4451849	0.000012
2001	HEPATITIS A	1.41	ALABAMA	66	4464034	0.000015
2002	HEPATITIS A	0.75	ALABAMA	35	4472420	0.000008
2003	HEPATITIS A	0.28	ALABAMA	14	4490591	0.000003
2004	HEPATITIS A	0.12	ALABAMA	6	4512190	0.000001

	Year	Name	Income	Measles Incidence	Pop Density
1665	1963	VIRGINIA	10106.6	0.001725	105.66
1666	1963	WASHINGTON	8393.9	0.001669	45.42
1667	1963	WEST VIRGINIA	3400.1	0.006963	75.96
1668	1963	WISCONSIN	10135.2	0.014131	75.58
1669	1963	WYOMING	878.9	0.000979	3.40