Final Project Disease Prevalence Across the States
Main Data from: https://think.cs.vt.edu/corgis/csv/health/
GitHub: https://tkoooo.github.io/Final-Portfolio/
Project Description and Goals: This dataset will allow us to examine the prevalence of various diseases across all 50 states over the last century. We plan to research factors such a vaccination rates and to try and understand the peaks and valleys in the dataset. We also want to examine what regions and states have been most and least affected and compare it with factors like weather patterns and population density.
Question: How does location correlate with probability of catching any 1 disease within the 20th century. Furhtermore, do other factors such as income and population density show this same correlation - this could potentially explain why certain states showed a higher incidence for disease when compared to others
Collaboration Plan: Our collaboration plan is to continue meeting at least once a week over zoom on Wednesday nights.
Data from: https://think.cs.vt.edu/corgis/csv/health/
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/cmps3160
%cd /content/drive/My Drive/cmps3160/finalProject
import pandas as pd
import matplotlib.pyplot as plt
#import pandas and matplotlib for graphing
df = pd.read_csv("../finalProject/health.csv", sep=",")
#read dataset into variable df
#check what values are missing
df[df.isnull().any(axis=1)]
Lucky for us, no data is missing and we can proceed onto analysis!
#Data types analysis
measles = df[df['disease'] == "MEASLES"]
measles['increase'].min()
#in order to visualize the case prevelance across the entire country, we will drop the
#location and increase columns and then create seperate dataframes for each disease
#df = df.drop(columns=['loc', 'increase'])
measles = df[df['disease'] == "MEASLES"]
polio = df[df['disease'] == "POLIO"]
smallpox = df[df['disease'] == "SMALLPOX"]
pertussis = df[df['disease'] == "PERTUSSIS"]
hepa = df[df['disease'] == "HEPATITIS A"]
rubella = df[df['disease'] == "RUBELLA"]
mumps = df[df['disease'] == "MUMPS"]
Now that we have datasets that hold only the specific disease information for each illness, we can begin to perform some basic analysis.
We will extract data to get numbers in the country instead of by state and perform further analysis to calculate the incidence for each disease each year.
In public health there are 2 main ways to measure how disease affects a population: incidence and prevalence.
Incidence refers to the number of people that develop/are diagnosed within a time period divided by the total sample size, while prevalence refers to the number of people being actively affected by the disease (whether they were diagnosed within the time period or not) divided by the total smaple size.
Because this data set provides us with the number of new cases per year and nothing about the amount of time people were sick, we will be focusing on calculating the incidence for each disease throughout the entire country.
This will provide 1 helpful dataset for each disease that will allow us to graph both the case numbers in the country as well as the proportion of people affected over the years.
#Basic function to create dataframes
def initial_dataframes(df):
#grab the case numbers country wide for each year
country = df.groupby(by='year')['number'].sum()
#grab the population size country side for each year
population = df.groupby(by='year')['population'].sum()
#transform the number and population series in to dataframes for future operations
country = country.to_frame()
population = population.to_frame()
#join the number and population dataframes to allow us to create a new incidence column
incidence = population.join(other=country, on='year')
#create a new column called incidence that shows the proportion of the population affected each year
incidence["incidence"] = incidence["number"]/incidence["population"]
#change the index to regular numbers rather than the year
incidence.reset_index(inplace =True)
return country, population, incidence
We also need to bring in the vaccine dataframe before plotting our disease case numbers. The years for each vaccine we are investigating were taken from: https://www.immunize.org/timeline/
vaccine_data = [("Measles", 1963), ("Polio", 1955), ("Smallpox", 1798), ("Pertussis", 1915), ("Hepatitis A", 1995), ("Rubella", 1971), ("Mumps", 1971)]
vaccine_df = pd.DataFrame(vaccine_data, columns=['Disease', 'Year']) #Creating a dataframe out of each disease and the year their vaccine came out
vaccine_df = vaccine_df.sort_values('Year') #We want this sorted by year because we run into annotation issues later if not since Rubella and Mumps are in the same year
vaccine_df.reset_index(inplace=True)
vaccine_df.drop(columns=['index'], inplace=True) #Cleaning up data
vaccine_df
measles_country, measles_population, measles_incidence = initial_dataframes(measles)
measles_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers
#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Measles']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Measles'], 800000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});
plt.title("Measles Case Numbers Over Time"); #Giving graph a title
polio_country, polio_population, polio_incidence = initial_dataframes(polio)
polio_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers
#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Polio']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Polio'], 50000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});
plt.title("Polio Case Numbers Over Time"); #Giving graph a title
smallpox_country, smallpox_population, smallpox_incidence = initial_dataframes(smallpox)
smallpox_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers')
plt.title("Smallpox Case Numbers Over Time"); #Giving graph a title
pertussis_country, pertussis_population, pertussis_incidence = initial_dataframes(pertussis)
pertussis_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers')
plt.title("Pertussis Case Numbers Over Time"); #Giving graph a title
hepa_country, hepa_population, hepa_incidence = initial_dataframes(hepa)
hepa_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10),legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers
#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Hepatitis A']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Hepatitis A'], 55000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});
plt.title("Hepatitis A Case Numbers Over Time"); #Giving graph a title
rubella_country, rubella_population, rubella_incidence = initial_dataframes(rubella)
rubella_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers
#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Rubella']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Rubella'], 50000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});
plt.title("Rubella Case Numbers Over Time"); #Giving graph a title
mumps_country, mumps_population, mumps_incidence = initial_dataframes(mumps)
mumps_incidence.plot.line(x = 'year', y = 'number', figsize=(20,10), legend=False, xlabel='Year', ylabel='Case Numbers') #Plotting case numbers
#Now to plot when the vaccine came out:
plt.axvline(x=vaccine_df[vaccine_df['Disease'] == 'Mumps']['Year'].tolist()[0], color = 'black'); #Plotting a vertical line
plt.annotate(s = 'Vaccine', xy=(vaccine_df['Year'].loc[vaccine_df['Disease'] == 'Mumps'], 130000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});
plt.title("Mumps Case Numbers Over Time"); #Giving graph a title
fig = plt.figure(figsize=(20,10));
colors = ['red', 'orange', 'green', 'blue', 'purple', 'brown', 'pink']; #colors for legend
colors_2 = ['green', 'blue', 'orange', 'red', 'brown', 'pink', 'purple']; #colors matching vaccine dataset
i = 0
# Plotting each disease over time
for frame in [measles_incidence, polio_incidence, smallpox_incidence, pertussis_incidence, hepa_incidence, rubella_incidence, mumps_incidence]:
plt.plot(frame['year'], frame['number'], color=f'{colors[i]}');
i += 1
for i in range(0, len(vaccine_df)): #this plots vertical lines where the vaccines came out
if vaccine_df.iloc[i].Year < 1930: #this removes the whitespace that appears before the graph because of the vaccines that came out before 1930
continue
plt.axvline(x=vaccine_df.iloc[i].Year, color = f'{colors_2[i]}'); #this matches the color of the vertical line to the color of the disease in the legend
plt.annotate(s = f'{vaccine_df.iloc[i].Disease}', xy=(vaccine_df.iloc[i].Year, 900000 - i*50000), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});
plt.legend(["Measles", "Polio", "Smallpox", "Pertussis", "Hepatitis A", "Rubella", "Mumps"]);
plt.title("Disease Case Numbers");
plt.xlim(1925, 2015); #we have to cut off the limits because 2 of the vaccines appeared way before 1930 and make the graph hard to read
fig = plt.figure(figsize=(20,10));
colors = ['red', 'orange', 'green', 'blue', 'purple', 'brown', 'pink']; #colors for legend
colors_2 = ['green', 'blue', 'orange', 'red', 'brown', 'pink', 'purple']; #colors matching vaccine dataset
i = 0
# Plotting each disease over time
for frame in [measles_incidence, polio_incidence, smallpox_incidence, pertussis_incidence, hepa_incidence, rubella_incidence, mumps_incidence]:
plt.plot(frame['year'], frame['incidence'], color=f'{colors[i]}');
i += 1
for i in range(0, len(vaccine_df)): #this plots vertical lines where the vaccines came out
if vaccine_df.iloc[i].Year < 1930: #this removes the whitespace that appears before the graph because of the vaccines that came out before 1930
continue
plt.axvline(x=vaccine_df.iloc[i].Year, color = f'{colors_2[i]}'); #this matches the color of the vertical line to the color of the disease in the legend
plt.annotate(s = f'{vaccine_df.iloc[i].Disease}', xy=(vaccine_df.iloc[i].Year, 0.006 - i*0.0002), xytext=(20, 0.0055), textcoords='offset points', annotation_clip=True, arrowprops={'width': 1, 'headwidth': 1, 'headlength': 1, 'shrink':0.05});
plt.legend(["Measles", "Polio", "Smallpox", "Pertussis", "Hepatitis A", "Rubella", "Mumps"]);
plt.title("Disease Incidence");
plt.xlim(1925, 2015); #we have to cut off the limits because 2 of the vaccines appeared way before 1930 and make the graph hard to read
It is worth nothing that the vaccines for Pertussis and Smallpox were released prior to the first collection year of this dataset, so they are nto marked
All of the vaccines seem to have been released after the peak of the disease which may speak to the either the speed and efficiency of US Public Health, the perceived need, or the state of science at the time.
Now that we have a solid idea of what the trends look like on a national scale, we can zoom back in to look at individual states to try and see how they compare.
One great way to visualize disease incidence per state is to make a spatial heatmap of the US. Given any year and disease, we can use the Geopandas library to visualize the prevalence of the disease in each state.
Cartographic Boundary Files 2012 - Shapefile of the United States: https://www.sciencebase.gov/catalog/item/52c78623e4b060b9ebca5be5
We originally got our data from the US Census, but there was too much data (each state had multiple coordinates for its regions). We found this dataset from sciencebase.gov, which is a U.S. Geological Survey Trusted Digital Repository. Their files are simpler and contain single coordinates for each state. We want only one row of data for each state so that we can easily join our data with our other datasets.
We followed these tutorials to implement our code:
https://mohammadimranhasan.com/geospatial-data-mapping-with-python/
First, we need to install geopandas in order to create maps and import our coordinates file.
!pip install geopandas
#!pip install --upgrade pyshp
#!pip install --upgrade shapely
#!pip install --upgrade descartes
We need to read in the zip file.
import geopandas
states = geopandas.read_file("tl_2012_us_state.zip")
states.head()
Cleaning up the dataset
There are some columns we don't need, and we can rename the columns to make them look neater.
states.columns.unique() #grabbing column names
states.drop(columns=['OBJECTID', 'REGION', 'DIVISION', 'STATEFP', 'STATENS', 'GEOID',
'STUSPS', 'LSAD', 'MTFCC', 'FUNCSTAT', 'ALAND', 'AWATER',
'INTPTLAT', 'INTPTLON', 'Shape_Leng', 'Shape_Area'], inplace=True) #dropping unnecessary columns
states.rename(columns={'NAME':'State', 'geometry': 'Geometry'}, inplace=True) #renaming the columns to make them look prettier
states['State'] = states['State'].str.upper() #in order to join with previous datasets we want this to match
states.head()
states.shape[0]
From the code above, we see that there are 56 rows, even though we know there are 50 states. By looking at the data, we see acquired regions of the US included, such as Puerto Rico and Guam. We need to keep this in mind for joining our data.
We need to check for NaN values.
states.isnull().sum()
Perfect! We have no NaN values.
Now we check the dtypes. We expect State should be an object and Geometry should be a geometry (because of the way it was loaded with geopandas).
Just to test that we are using GeoPandas correctly, here is a simple map of the US.
states = geopandas.GeoDataFrame(states, geometry='Geometry') #turning the dataframe into a geodataframe object
states.plot(figsize=(10, 10))
plt.xlim(-2.25e7, -0.5e7)
plt.ylim(0.0, 1.2e7)
Great! We are good to move on with analysis. This work will be reffered to later in the seperate sections for national incidence of each disease
This function takes in one of the generic disease dfs and adds on a column called incidence that calculates the fraction of new cases/total populateion for every state. It returns 2 things - a DataFrame called states_disease_incidence that has the incidence for every states every year for the given disease, and a dictionary called states_disease which has seperate Dataframes for each state with the year and case number for the given disease
#Incidence Column Function
def incidence_df(df):
#create a dictionary of dataframs for every state
states_disease = dict()
for k, v in df.groupby('loc'):
states_disease[k] = v
#create an incidence column for every dataframe in the dictionary
for i in states_disease:
states_disease[i]['incidence'] = states_disease[i]['number']/states_disease[i]['population']
#set the index to year for easy combining later
states_disease[i].set_index('year', inplace = True)
#create new dataframe of all of the incidences by state for every year
states_disease_incidence = pd.DataFrame()
#enter the incidence columns from the state datasets into the incidence dataset
for i in states_disease:
states_disease_incidence[i] = states_disease[i]['incidence']
return states_disease_incidence, states_disease
This function takes in the states disease DataFrame created in the incidence df function and properly formats it for ease of graphing with Geopandas.
#create a DataFrame to make the visual map of incidence
def map_df(states_disease_incidence):
map_disease_df = states_disease_incidence.T #transpose to get states in a column
map_disease_df.reset_index(inplace=True)
map_disease_df.rename(columns={'index':'State'}, inplace=True)
map_disease_df = pd.merge(map_disease_df, states, on="State", how='left') #we want to use the keys of the incidence values because we know those are 50 states, not including Guam, etc.
return map_disease_df
The create_map function takes in 'sel', the selected year we want to examine in the form of an integer, 'map_disease_df', the DataFrame we created in the map_df function specifically for graphing, and the name of the disease in the form of a string to properly label the resulting grpah.
The colormapped scalebar on the right changes per disease, but does not change depending on the year chosen because it is based on the min and max values per disease.
def create_map(sel, map_disease_df, disease):
map_disease_df = geopandas.GeoDataFrame(map_disease_df, geometry='Geometry') #turning the dataframe into a geodataframe object to be used with geopandas to plot
#Here we are trying to find the min and max values to use for our scalebar
#We don't want to do this for each year because looking at the graph over time would be misleading if you kept changing the scalebars
map_disease_df.columns.unique()
merged_cols = map_disease_df.columns.unique()
merged_cols = merged_cols[1:-1] #only grabbing the years
vmin = map_disease_df[merged_cols].min().min()
vmax = map_disease_df[merged_cols].max().max()
#Plotting
fig, ax = plt.subplots(1, figsize=(10,7))
map_disease_df.plot(column=sel, figsize=(10, 7), cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8')
plt.xlim(-2.25e7, -0.5e7) #resetting axes
plt.ylim(0.5, 1.2e7)
ax.axis('off') #remove axis because it means nothing for us
ax.set_title(f'Incidence Rate of {disease} in {sel} across the US')
sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=vmin, vmax=vmax)) #for color scalebar
sm._A = []
cbar = fig.colorbar(sm)
We will now use the 3 functions created above to visualize the state incidences for any given year
#utilize the above functions to create dictionary of state specific DataFrames
states_measles_incidence, states_measles = incidence_df(measles)
states_measles_incidence.head()
states_measles['ALABAMA']
map_measles_df = map_df(states_measles_incidence)
map_measles_df.head()
create_map(1960, map_measles_df, 'Measles')
#utilize the above functions to create dictionary of state specific DataFrames
states_polio_incidence, states_polio = incidence_df(polio)
states_polio_incidence.head()
states_polio['ALABAMA']
map_polio_df = map_df(states_polio_incidence)
map_polio_df.head()
create_map(1960, map_polio_df, 'Polio')
#utilize the above functions to create dictionary of state specific DataFrames
states_pertussis_incidence, states_pertussis = incidence_df(pertussis)
states_pertussis_incidence.head()
states_pertussis['ALABAMA']
map_pertussis_df = map_df(states_pertussis_incidence)
map_pertussis_df.head()
create_map(2011, map_pertussis_df, 'Pertussis')
#utilize the above functions to create dictionary of state specific DataFrames
states_smallpox_incidence, states_smallpox = incidence_df(smallpox)
states_smallpox_incidence.head()
states_smallpox['ALABAMA']
map_smallpox_df = map_df(states_smallpox_incidence)
map_smallpox_df.head()
create_map(1940, map_smallpox_df, 'Smallpox')
#utilize the above functions to create dictionary of state specific DataFrames
states_hepa_incidence, states_hepa = incidence_df(hepa)
states_hepa_incidence.head()
states_hepa['ALABAMA']
map_hepa_df = map_df(states_hepa_incidence)
map_hepa_df.head()
create_map(2011, map_hepa_df, 'Pertussis')
#utilize the above functions to create dictionary of state specific DataFrames
states_rubella_incidence, states_rubella = incidence_df(rubella)
states_rubella_incidence.head()
states_rubella['ALABAMA']
map_rubella_df = map_df(states_rubella_incidence)
map_rubella_df.head()
create_map(1970, map_rubella_df, 'Rubella')
#utilize the above functions to create dictionary of state specific DataFrames
states_mumps_incidence, states_mumps = incidence_df(mumps)
states_mumps_incidence.head()
states_mumps['ALABAMA']
map_mumps_df = map_df(states_mumps_incidence)
map_mumps_df.head()
create_map(1970, map_mumps_df, 'Mumps')
Now that we have an understanding and datasets portraying incidence on a national and state wide scale, we can see how the states compare to the national average and establish a ranking of what states were affected by a given disease the most each year.
We will start by creating a dataset that has the incidence for every state every year as well as the entire country.
def US_state_df(disease_incidence, states_disease_incidence):
#reset index to year for joining
disease_incidence.set_index('year', inplace = True)
#grab the national incidence and add it into a new dataframe with state incidences
US = disease_incidence['incidence']
US_state_disease_incidence = states_disease_incidence.join(US)
#rename column
US_state_disease_incidence.rename(columns={'incidence':'UNITED STATES'}, inplace = True)
return US_state_disease_incidence
def graph_state_incidence(US_state_disease_incidence):
states = list(US_state_disease_incidence.columns)
states.remove('UNITED STATES')
disease_compare_incidence = US_state_disease_incidence[states].div(US_state_disease_incidence['UNITED STATES'], axis=0)
disease_hist = pd.DataFrame()
for state in disease_compare_incidence:
count = disease_compare_incidence[state][disease_compare_incidence[state] > 1].count()
disease_hist.insert(0, state, [count])
disease_hist = disease_hist.T
disease_hist.sort_values(0, ascending = False, inplace = True)
disease_hist.plot.bar(title='Number Of Times State Incidence is Above National Average Incidence', figsize=(20, 10), legend=False)
US_state_measles_incidence = US_state_df(measles_incidence, states_measles_incidence)
graph_state_incidence(US_state_measles_incidence)
US_state_polio_incidence = US_state_df(polio_incidence, states_polio_incidence)
graph_state_incidence(US_state_polio_incidence)
US_state_pertussis_incidence = US_state_df(pertussis_incidence, states_pertussis_incidence)
graph_state_incidence(US_state_pertussis_incidence)
US_state_smallpox_incidence = US_state_df(smallpox_incidence, states_smallpox_incidence)
graph_state_incidence(US_state_smallpox_incidence)
US_state_hepa_incidence = US_state_df(hepa_incidence, states_hepa_incidence)
graph_state_incidence(US_state_hepa_incidence)
US_state_rubella_incidence = US_state_df(rubella_incidence, states_rubella_incidence)
graph_state_incidence(US_state_rubella_incidence)
US_state_mumps_incidence = US_state_df(mumps_incidence, states_mumps_incidence)
graph_state_incidence(US_state_mumps_incidence)
Data From CDC: https://www.census.gov/data/tables/time-series/dec/density-data-text.html
df_density = pd.read_csv("../finalProject/apportionment.csv", sep=",")
df_density.head()
#grab column names for easy transformation
df_density.columns
#['Name', 'Geography Type', 'Year', 'Resident Population',
#'Percent Change in Resident Population', 'Resident Population Density',
#'Resident Population Density Rank', 'Number of Representatives',
#'Change in Number of Representatives',
#'Average Apportionment Population Per Representative']
#first we need to get rid of the unnecessary information just to make things cleaner
df_density = df_density.drop(columns=['Resident Population', 'Resident Population Density Rank',
'Percent Change in Resident Population',
'Number of Representatives', 'Change in Number of Representatives',
'Average Apportionment Population Per Representative'])
df_density
There are also some values within the dataframe that we don't need (at least for initial use with our first dataset) such as the regions and the nationwide information. To change this we can just grab the rows with the type "State"
pop_density = df_density[df_density['Geography Type'] == 'State']
pop_density = pop_density.drop(pop_density[pop_density['Name'] == 'Puerto Rico'].index)
pop_density
pop_density[pop_density['Name'] == 'Alabama']
#check the datatypes of the dataset
pop_density.dtypes
#change the resident population density to a float
pop_density['Resident Population Density'] = pd.to_numeric(pop_density['Resident Population Density'],errors='coerce')
pop_density['Resident Population Density'].describe()
#We also need to make the states uppercase for easu combining later
pop_density['Name'] = pop_density['Name'].str.upper()
pop_density.head()
The final step to combining this with the initial disease dataset is to ensure that all of the years present in the disease dataset also exist in the population density dataset. This data is grabbed from the census so there are only reports for the 1st year of every decade. While not exact, we can fill in the expected density for each year based on the values provided for each decade. This is done by creating rows for each year and then using interpolate to fill in NaN values by approximating them.
#create list of years needed to fill in the population density data frame
years = list(range(1911, 2020, 1))
years.remove(1920)
years.remove(1930)
years.remove(1940)
years.remove(1950)
years.remove(1960)
years.remove(1970)
years.remove(1980)
years.remove(1990)
years.remove(2000)
years
#itereate over populaiton density to create and fill in a dataframe for each state
states_density = dict()
for k, v in pop_density.groupby('Name'):
states_density[k] = v
for year in years:
states_density[k] = states_density[k].append({'Name':k, 'Geography Type':'State', 'Year':year}, ignore_index=True)
states_density[k].sort_values(by = 'Year', inplace = True)
#use interpolate to approximate population density for each year in each state.
for frame in states_density:
states_density[frame].interpolate(method='linear', column = 'Resident Population Density', inplace=True)
#As an example for what is being stored, here are 1 DataFrame from states_density
states_density['WISCONSIN']
In order to be more workable, we need the state population densities to be in one large dataframe instead of a dictionary of dataframes. We wish to find a correlation between disease incidence and population density.
We hypothesize that higher population density would indicate a higher incidence as disease can mroe easily spread.
#put all state population density into one large dataframe
years = states_density['ALABAMA']['Year']
list_of_names = states_density.keys()
res_pop_dens = pd.DataFrame()
for name in list_of_names:
res_pop_dens[name] = states_density[name]["Resident Population Density"]
res_pop_dens.insert(0, 'Year', years)
#set the index to year for easy accessing
res_pop_dens.set_index('Year', inplace=True)
Because we need to find the correlation for every state for every disease, it is best to write a function to calculate and append this information for us
def correlation_function(states, states_disease_incidence, res_pop_dens, name):
#create empty array to store correlation data
data = []
for i in states: #calculates the correlation between incidence of disease and average population density for each state
data.append(states_disease_incidence[[i]][i].corr(res_pop_dens[[i]][i]))
#add this data into the larger Dataframe
correl_density_df[name] = data
#Making correlation density dataframes per disease into 1 big dataframe
all_states = res_pop_dens.columns.unique()
correl_density_df = pd.DataFrame(columns = ['Location']) #creating a dataframe of 2 columns
correl_density_df['Location'] = all_states #filling in location with state names
#measles
correlation_function(all_states, states_measles_incidence, res_pop_dens, 'Correlation for Measles')
#Hepatitis A
correlation_function(all_states, states_hepa_incidence, res_pop_dens, 'Correlation for Hepatitis A')
#Mumps
correlation_function(all_states, states_mumps_incidence, res_pop_dens, 'Correlation for Mumps')
#Pertussis
correlation_function(all_states, states_pertussis_incidence, res_pop_dens, 'Correlation for Pertussis')
#Polio
correlation_function(all_states, states_polio_incidence, res_pop_dens, 'Correlation for Polio')
#Rubella
correlation_function(all_states, states_rubella_incidence, res_pop_dens, 'Correlation for Rubella')
correl_density_df.head()
#report the maximum and minimum correlation states for measles as well as the average
print('Max correlation for measles: ', correl_density_df['Location'].iloc[correl_density_df['Correlation for Measles'].idxmax()])
print(states_density['WEST VIRGINIA'])
print('Min correlation for measles: ', correl_density_df['Location'].iloc[correl_density_df['Correlation for Measles'].idxmin()])
print(states_density['NEW YORK'])
print('Avg correlation for measles: ', correl_density_df['Correlation for Measles'].mean())
The average correlation between state population and incidence for that state for measles is -.53 which indicates a moderate negative correlation.
Our hypothesis was disproved as it would appear that as population density increases, incidence shows a moderate decrease. This could be due to bias on our assumption that all disease spread primarily through close contact with humans as well as an oversimplification of what variables may determine contraction.
The maximum correlation existed for West Virginia, which when compared to the minimum correlation of New York, has a considerably lower average population density. This indicates that perhaps there are other variables at play that may have some slight correlation to population density but do not have the same trends and hence lead to variable correlation values. This could be access to health care and abilty to be diagnosed and treated which can vary widely between states like WV and NY where city centers and population distribution appear very different.
We will attempt to investigate whether average income level by state has any correlation with the disease prevalence by state.
This data source comes from the U.S. Bureau of Economic Analysis (BEA). It contains average incomes for each state over the years 1929 to 2021.
income_df = pd.read_csv("../finalProject/income.csv", sep=",", skiprows = 3)
#reading in income file as a dataframe
#we need to skip the first 3 rows because there are some text explanations written in these rows
income_df.head()
#original dataframe
Data Filtering
We do not have a use for the GeoFips column, which is just the geo-codes corresponding to each state. We also need to get rid of the footnote rows at the bottom of the document.
The datasource notes that estimates prior to 1950 are not available for Alaska and Hawaii.
import numpy as np
income_df
#income_df.drop(income_df[(income_df['GeoFips'] == object)].index, inplace=True)
income_df.drop(income_df.index[list(range(60, 65))], inplace=True)
#We only want to keep the first 60 rows because there are unnecessary lines at the end.
income_df.drop(columns = ["GeoFips"], inplace=True)
#We don't care about the GeoFips column
#We want to change the names that have asterisks to make the data look nicer
index_alaska = income_df.index[income_df['GeoName'] == 'Alaska *'].tolist()[0]
index_hawaii = income_df.index[income_df['GeoName'] == 'Hawaii *'].tolist()[0]
income_df.at[index_alaska, 'GeoName'] = 'Alaska'
income_df.at[index_hawaii, 'GeoName'] = 'Hawaii'
#income_df['GeoName'].loc[income_df['GeoName'].isin(['Alaska *'])] = 'Alaska'
#income_df['GeoName'].loc[income_df['GeoName'].isin(['Hawaii *'])] = 'Hawaii'
income_df.replace('(NA)', np.nan, inplace=True)
#Replacing NA values with NaN
income_df.isnull().sum()[income_df.isnull().sum() > 2]
#Since we know from the data source that 2 states do not have values before 1950, we check to see if there are any other missing values from the other 48 states.
#We see that we return an empty series, so that means only two states had NaN's like we expected.
Now we display the resulting dataset after this cleanup:
income_df.head() #resulting dataset
Now we need to complete some transformations to prepare the dataset for analysis.
income_df1 = income_df.set_index('GeoName') #reset index to be by state
income_dfT1 = income_df1.T #transpose
income_dfT1.reset_index(inplace=True) #adding year back as a column to use for plotting
income_dfT1.columns = income_dfT1.columns.str.upper() #need these to be uppercase to join with population density dataset
income_dfT1.rename(columns={'INDEX':'year'}, inplace=True) #renaming column to make more clear
income_dfT1.drop(['NEW ENGLAND', 'MIDEAST', 'GREAT LAKES', 'PLAINS', 'SOUTHEAST', 'SOUTHWEST', 'ROCKY MOUNTAIN', 'FAR WEST'], axis=1, inplace=True) #we want to drop these columns so that they don't mess up any averaging
income_dfT1.head()
We need to check that the datatypes are correct.
income_dfT1.dtypes
Oops! These are all objects, when we expect integers for years and floats for the income amounts.
cols_income = income_dfT1.columns
income_dfT1[cols_income] = income_dfT1[cols_income].apply(pd.to_numeric, errors='coerce') #changing datatypes from objects to floats/integers where appropriate
income_dfT1['ARIZONA'] #checking datatype
income_dfT1.dtypes
Great! All fixed to be as expected.
Analysis
Now we plot income level by state over time just to get a quick visual of the data.
We only plot the max 7 states because 50 states would give us a whacky legend and a difficult graph to read.
#We sort the max income values series in descending order to grab the max 7 states, and then grab the names of those states
income_dfT1.max().sort_values(axis=0, ascending=False)[1:8].index.tolist() #Rows 0 is the United States, but we only want the max 10 states
#Then we plot
income_dfT1.plot(x = 'year', y=income_dfT1.max().sort_values(axis=0, ascending=False)[1:8].index.tolist())
plt.legend(income_dfT1.max().sort_values(axis=0, ascending=False)[1:8].index.tolist());
plt.title("Income Level Over Time per State")
plt.xlabel('Year')
plt.ylabel('Income')
The above plot is just useful to note that income generally increased extremely over time for each state, and it is evident that some states have significantly more average income than others, such as California.
It is important to note that the cost of living is not the same in each state as well.
income_dfT1.set_index('year', inplace=True)
income_dfT1.head()
all_states = states_measles_incidence.columns.unique() #grabbing the names of all the states
We wish to explore the relationship between incidence and average income per state.
We can calculate the correlation of the two for each state over all of the years that we have data available for, for each disease separately.
correl_income_df = pd.DataFrame(columns = ['Location']) #creating a dataframe of 2 columns
correl_income_df['Location'] = all_states #filling in location with state names
incidences = [states_measles_incidence, states_hepa_incidence, states_mumps_incidence, states_pertussis_incidence, states_polio_incidence, states_rubella_incidence, states_smallpox_incidence]
#Creating a function to automate the process of iterating through each disease
def correl_income_df_func(incidence, dis_name):
meas_corr_data_func = []
for i in all_states: #calculates the correlation between incidence of measles and average income of the state and appends to an array which we will throw into the dataframe later
meas_corr_data_func.append(incidence[[i]][i].corr(income_dfT1[[i]][i]))
correl_income_df[f'Correlation for {dis_name}'] = meas_corr_data_func #Creating a new column for this disease and throw in the correlation data
#Filling in the resultant correlation dataframe
correl_income_df_func(states_measles_incidence, 'Measles')
correl_income_df_func(states_hepa_incidence, 'Hepatitis A')
correl_income_df_func(states_mumps_incidence, 'Mumps')
correl_income_df_func(states_pertussis_incidence, 'Pertussis')
correl_income_df_func(states_polio_incidence, 'Polio')
correl_income_df_func(states_rubella_incidence, 'Rubella')
correl_income_df
We get a negative correlation each time as our result!
This makes sense because this is saying that for a higher income, the chances of contracting measles is less. Vice versa, for a lower income, the chances of contracting measles is higher. This does not establish causality, but it certainly displays an expected relationship!
Now we can compute some stats about our table.
print('Max correlation for measles: ', correl_income_df['Location'].iloc[correl_income_df['Correlation for Measles'].idxmax()])
print('Min correlation for measles: ', correl_income_df['Location'].iloc[correl_income_df['Correlation for Measles'].idxmin()])
print('Avg correlation for measles: ', correl_income_df['Correlation for Measles'].mean())
It is interesting to point out that the average correlation over all of the states for the entire period for measles is just under 50%.
We note that we know we cannot make strong claims on this data because income also varies between state based on average living costs.
In addition, we recognize it would have been better to have data for income level of each person who got sick since this income set is just the average income of the state including people who aren't sick. This is just to explore a relationship between overall state wealth and illness.
We can compute these stats for the other diseases too.
Our idea for a predictive model is to create a hypothetical situation and predict the chances that someone in a particular state would contract a certain disease in a couple of years after.
For instance, we could say, "Imagine you are living in Louisiana in 1955, and you want to know the chances of you catching Measles in 1967."
The independent variables used to train the model would be the year, state, disease, and income level and the dependent variable to be predicted on would be the probability that this person will contract the disease at hand.
In terms of the methodology to go about this, we can train a k-nearest neighbors regression model to predict the probability for contracting the disease for different values of k. We can calculate the training and validation mean absolute error of each model. Because this data involves patterns over time, we need to be careful in how we split our data. We cannot, for example, split our data in half, shuffle it up, and do regular cross-validation. The reason being is that this is somewhat 'cheating' because the model gets to see what the future trend looks like. We can plot our errors as a function of k to determine which is the optimal k to use in order to get our final predicted value.
In order to perform a more robust analysis, we can determine the effects of certain factors on our predictive model. For instance, we can test whether splitting the data differently or taking out certain features improves the error of our model.
We need to do a quick operation to prepare the income dataset with the disease incidence tables.
#removing unnecessary columns from income
income_dfT1.drop(income_dfT1.loc[2012:].index, inplace=True)
income_dfT1.drop(columns=['UNITED STATES'], inplace=True)
income_dfT1
We make our large dataset that contains all of our data - year, state, income, incidence of each disease, and population density.
from itertools import product
copy = income_dfT1
tuples = list(product(copy.index, copy.columns)) #creating all possible combinations
tuples
index = pd.MultiIndex.from_tuples(tuples, names=["year", "state"]) #creating multiindex to use in new datafra,e
altogether = pd.DataFrame(index=index, columns=['Income']) #creating empty dataframe to later fill in values for
#Adding in income into total dataframe
for i in income_dfT1: #states - cols
for j in income_dfT1.index: #years - rows
altogether.at[(j, i), 'Income'] = income_dfT1.loc[j, i]
#Creating a function that adds each disease incidence dataframe to total dataframe
def add_disease_inc(df, name):
for i in df: #states - cols
for j in df.index: #years - rows
altogether.at[(j, i), f'{name}'] = df.loc[j, i]
#Calling function for each disease
add_disease_inc(states_measles_incidence, 'Measles Incidence')
add_disease_inc(states_mumps_incidence, 'Mumps Incidence')
add_disease_inc(states_hepa_incidence, 'Hep-A Incidence')
add_disease_inc(states_pertussis_incidence, 'Pertussis Incidence')
add_disease_inc(states_polio_incidence, 'Polio Incidence')
add_disease_inc(states_rubella_incidence, 'Rubella Incidence')
add_disease_inc(states_smallpox_incidence, 'Smallpox Incidence')
#Creating dataframe of population densities
d = { k: v.set_index(['Year', 'Name']) for k, v in states_density.items()} #setting index to Year and Name for every dataframe within the dictionary states_density
df_densities = pd.concat(d, axis=0) #putting all of the dataframes into 1 dataframe
df_densities.reset_index(inplace = True)
df_densities.set_index(['Year', 'Name'], inplace=True)
df_densities
#Merging population density dataframe with big dataframe
altogether = altogether.merge(df_densities, left_index=True, right_on=['Year', 'Name'])
altogether.drop(columns= ['level_0', 'Geography Type'], inplace=True) #Don't need these columns
altogether.rename(columns={"Resident Population Density": "Pop Density"}, inplace=True)
#Sorting in order and making year and state name individual columns
altogether.sort_index(inplace=True, ascending=True)
altogether.reset_index(inplace=True)
altogether
Now, we examine only the case of measles to make things simpler, and keep in mind that we can examine any disease.
Additionally, we will drop any nans here as any data at this point such as the disease incidence can not be linearly interpolated without potentially skewing and creating false trends.
#Create shorter version of altogether dataframe
measles_together = altogether.drop(columns=['Mumps Incidence', 'Hep-A Incidence', 'Pertussis Incidence', 'Polio Incidence', 'Rubella Incidence', 'Smallpox Incidence'])
#Drop nans
measles_together.dropna(inplace=True)
#Reset index after drop
measles_together.reset_index(inplace=True)
measles_together.drop(columns=['index'], inplace=True)
measles_together
We can create a predictor for measles that just splits the data in half.
With this predictor, we expect that our predicted incidence value will be much higher than in actuality because we have to predict on a year just after the vaccine came out- this is a piece of information that our model does not have nor is it trained to account for.
First, we split the data in half. Then, we define our features to train on as the year, state, income, and population density.
We can use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.
#JUST MEASLES PREDICTOR
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
# Fit model to data
from sklearn.neighbors import KNeighborsRegressor
# split data in half
# There are 2949 rows in measles_together, so 2949/2 = 1474.5
train = measles_together.iloc[0:1475]
val = measles_together.iloc[1475:]
# defining features
features = ["Year",
"Name",
"Income",
"Pop Density"]
X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")
y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]
#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 10-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=i)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
return rmse
val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
cross_val = (val1 + val2)/2
error_results.append(cross_val)
We plot these k values and the corresponding errors they give to determine the best k to use.
#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)
#xpos = np.where(error_results == ymin)
#xmin = xpos[0][0]
#print('xmin = ', xmin)
From the graph above, we see that the optimal k to use is 57. This makes sense because there are 50 states for a given year each of which are likely experiencing similar trends.
We don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.
Next, we take this k and make our predictions.
# Use k=57 to predict
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 57-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=57)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
#print(y_val_pred)
print('val error: ', ymin)
If we want to know the chances of catching measles in 1970 in Alabama, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe
val[val['Year'] == 1970]
#Find that 1970 ALabama is at index 1967
#1967 - 1475 = 492 gives us the index in the predicted array
y_val_pred[492]
If I want to know the chances of catching measles in 1970 in Alabama, the results show that the incidence is 0.002460425878952384.
The expected value is 0.000139. It makes sense that the obtained and expected values are off because we were predicting on a year right after the measles vaccine came out, and the incidence values drastically drop off at this point.
#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((.002460425878952384 - .000139)/ .000139) * 100
The percent error here is 1670%! Let's see if we can reduce this.
Now, to examine another way of splitting the data, we can split the data into about quarters and only examine the times that the diseases were prevalent before vaccine release. We predict that our predicted incidence value will be closer to the expected than in Model 1.
Specifically for measles, the vaccine was released in 1963. So we want to examine measles incidence from 1929 to 1963 instead of 1929 to 2002 in order to give our model a better shot of actively predicting.
In order to figure out which indices to use from measles_together, we can see the last index where we have data for the year 1963.
measles_together[measles_together['Year'] == 1963].tail()
The last index where we have data for the year 1963 is 1669. Therefore, we will split the data in half from indices 0 to 1669.
We proceed as we did last time but split the data differently.
First, we split this portion of data in half. Then, we define our features to train on as the year, state, income, and population density.
We will again use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.
#JUST MEASLES PREDICTOR
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
# Fit model to data
from sklearn.neighbors import KNeighborsRegressor
# split data in half up to vaccine release in 1963
# 1669/2 = 834.5
train = measles_together.iloc[0:835]
val = measles_together.iloc[835:1669]
# defining features
features = ["Year",
"Name",
"Income",
"Pop Density"]
X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")
y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]
#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 10-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=i)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
return rmse
val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
cross_val = (val1 + val2)/2
error_results.append(cross_val)
We plot these k values and the corresponding errors they give to determine the best k to use.
#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)
From the graph above, we see that the optimal k to use is about 25. This is most likely because we reduced our training/test size in half.
We again don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.
Next, we take this k and make our predictions.
# Use k=25 to predict
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 25-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=25)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
#print(y_val_pred)
print('val error: ', ymin)
If we want to know the chances of catching measles in 1955 in Alabama (about halfway through the predicted dataset, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe.
val[val['Year'] == 1963]
#Find that 1955 ALabama is at index 1227
#1227 - 835 = 392 gives us the index in the predicted array
y_val_pred[392]
If I want to know the chances of catching measles in 1963 in Alabama, the results show that the incidence is 0.001797885794424771, compared to the expected value 0.000347.
#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((.001797885794424771 - .000347)/ .000347) * 100
The percentage error here is 418%, which is still pretty large, but drastically reduced from model 1 by 847%!
Additionally as a note to keep in mind, these incidence numbers are so small that the percent difference between them seems extremely high - so the difference between the percents is important.
Now, to examine what happens when remove a feature to predict on, we will remove the income feature. We predict based on our correlations between disease incidence and income that the error will go up.
Since our predictor 2 improved the percent error, we will build off of that model and examine the measles incidence from 1929 to 1963 right before the vaccine came out.
First, we split the data in half. Then, we define our features to train on as the year, state, and population density.
We can use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.
#JUST MEASLES PREDICTOR - no income
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
# Fit model to data
from sklearn.neighbors import KNeighborsRegressor
# split data in half up to vaccine release in 1963
# 1669/2 = 834.5
train = measles_together.iloc[0:835]
val = measles_together.iloc[835:1669]
# defining features WITHOUT INCOME
features = ["Year",
"Name",
"Pop Density"]
X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")
y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]
#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 10-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=i)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
return rmse
val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
cross_val = (val1 + val2)/2
error_results.append(cross_val)
We plot these k values and the corresponding errors they give to determine the best k to use.
#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)
From the graph above, we see that the optimal k to use is about 25 again, similar to our Model 2.
We again don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.
Next, we take this k and make our predictions.
# Use k=25 to predict
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 25-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=25)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
#print(y_val_pred)
print('val error: ', ymin)
If we want to know the chances of catching measles in 1955 in Alabama (about halfway through the predicted dataset, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe.
val[val['Year'] == 1955]
#Find that 1955 ALabama is at index 1227
#1227 - 835 = 392 gives us the index in the predicted array
y_val_pred[392]
If I want to know the chances of catching measles in 1963 in Alabama, the results show that the incidence is 0.002084132350947432, compared to the expected value 0.000347.
#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((0.002084132350947432 - .000347)/ .000347) * 100
The percentage error here is 500%, which is still pretty large, and actually increased from Model 2 by 82%. This increase is as we expected and shows that the correlation we observed earlier between incidence and state income does in fact affect our model.
Now, we will examine what happens when remove the population density feature to predict on. We predict based on our correlations between disease incidence and population density that the error will go down. Since our predictor 2 improved the percent error, we will build off of that model.
Again, we examine the measles incidence from 1929 to 1963 right before the vaccine came out.
First, we split the data in half. Then, we define our features to train on as the year, state, and population density.
We can use a for loop to test out different k values to use for our k-nearest neighbors model and get a validation error.
#JUST MEASLES PREDICTOR - no income
from pandas.core.dtypes.cast import maybe_unbox_datetimelike_tz_deprecation
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
# Fit model to data
from sklearn.neighbors import KNeighborsRegressor
# split data in half up to vaccine release in 1963
# 1669/2 = 834.5
train = measles_together.iloc[0:835]
val = measles_together.iloc[835:1669]
# defining features WITHOUT POPULATION DENSITY
features = ["Year",
"Name",
"Income"]
X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")
y_train = train["Measles Incidence"]
y_val = val["Measles Incidence"]
#For loop to determine optimal k to use
k_1 = [i for i in range(1,100)]
error_results = []
for i in k_1:
def get_val_error(X_train_dict, y_train, X_val_dict, y_val, i):
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 10-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=i)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
return rmse
val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val, i)
val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train, i) #reverse
cross_val = (val1 + val2)/2
error_results.append(cross_val)
We plot these k values and the corresponding errors they give to determine the best k to use.
#Plotting k with validation error
import matplotlib.pyplot as plt
plt.scatter(x=k_1, y=error_results)
plt.xlabel('k')
plt.ylabel('test error')
plt.title('k for Measles')
ymin = min(error_results)
From the graph above, we see that the optimal k to use is about 25 again, similar to our Model 2.
We again don't want to pick a k too close to 0, because this error is too good to be true since we would be picking data points too similar to the one we are predicting for.
Next, we take this k and make our predictions.
# Use k=25 to predict
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 25-nearest neighbors model.
model = KNeighborsRegressor(n_neighbors=25)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_val_pred = model.predict(X_val_sc)
rmse = np.sqrt(((y_val - y_val_pred) ** 2).mean())
#print(y_val_pred)
print('val error: ', ymin)
If we want to know the chances of catching measles in 1955 in Alabama (about halfway through the predicted dataset, we need to find the index of this datapoint in the dataframe we predicted on and map it to its index in our resulting prediction dataframe.
val[val['Year'] == 1955]
#Find that 1955 ALabama is at index 1227
#1227 - 835 = 392 gives us the index in the predicted array
y_val_pred[392]
If I want to know the chances of catching measles in 1963 in Alabama, the results show that the incidence is 0.002045478209686451, compared to the expected value 0.000347.
#Calculate percent error
#Percentage Error = ((Estimated Number - Actual Number)/ Actual number) x 100
((0.002045478209686451 - .000347)/ .000347) * 100
The percentage error here is 489%, which increased from Model 2 by 71%. This increase is not what we expected; however, it has slightly less of an increase as compared to Model 3, meaning that population density may have had less of an effect than income.
%%shell
jupyter nbconvert --to html "/content/drive/MyDrive/cmps3160/finalProject/Final_Project_Disease_Prevalence.ipynb"
Throughout this project we completed a lot of analysis specifically on disease incidence and trends between states within the population. We utilized several visualization techniques to demonstrate the variances both across the years as well as between states both seperate and in comparison to national trends. In an effort to understand these numbers more in depth, we included datasets focused on state population density and state average income level. We measured the correlation between each of these variables and disease incidence for every state over the years, and discovered that both population density and state average income feature a moderate negative correlation. From here we built dour different predictive models in an attempt to really demonstrate what variables could be related, and we discovered that the strongest thing to improve our model's predictions was to pick a smaller time. All of this indicates that while population density and state average income are potentially pieces of the complicated puzzle of disease spread, they are certainly not the whole picture, and without a better understanding of the individual people that make up those case numbers, our knowledge and what assumptions we can make are limited.