If you’ve spent time on social media in the last few years, you may have noticed one particular topic trending often. Now, more than ever, discussions of climate change and its impacts are mainstream. As students in the School of Public Health, our group was united by an interest in climate change and public health. We wanted to quantify the relationship between climate measure variables like particulate matter, ozone levels, and UV radiation, and see how they may impact health outcomes like asthma hospitalizations, lung cancer, and melanoma.
In exploring our outcomes, we asked the following questions in getting to know our data sets:
What are the current incidence rates of lung cancer, melanoma, and asthma emergency room visits? Do these trends differ significantly at the county and state levels?
How have these rates changed by state over time?
In analyzing our exposures and how they impact our outcomes, we asked the following:
How do lung cancer age-adjusted incidence rates vary annually based on median particulate matter levels? Does this change from state to state?
How do emergency room visits and/or hospitalization rates for asthma vary annually based on
median particulate matter levels? Does this change from state to state?
How do melanoma age-adjusted incidence rates vary annually based on UV levels? Does this change from state to state?
Is annual median particulate matter level a good predictor of lung cancer age-adjusted incidence rates? How does the regression change when state is added to the model?
Is annual median UV level a good predictor of melanoma age-adjusted incidence rates? How does the regression change when state is added to the model?
Is annual median particulate matter level a good predictor of annual asthma-related emergency room visits or hospitalizations rates? How does the regression change when state is added to the model?
Our ozone data and our air pollution data were both sourced from the CDC’s National Environmental Public Health Tracking website.
For county level cancer outcomes, each state’s data set was sourced from the NIH State Cancer Profiles.
For state level outcomes, each data set was sourced directly from health departments of each respective state.
A significant amount of cleaning was required, as can be seen in the code chunks below:
Code can be found on the Climate Data page.
#state-level lung cancer PA data https://www.phaim1.health.pa.gov/EDD/
pa_cancer = read.csv('./data/lung/pa_state_lung.csv', sep = ";", header = T, skip = 3) %>%
janitor::clean_names() %>%
select(year, rate_ratio_result) %>%
map_df(str_replace, pattern = ",", replacement = ".") %>%
map_df(as.numeric) %>%
rename(c("age_adjusted_incidence_rate" = "rate_ratio_result")) %>%
add_column(state = "PA")
#state-level lung cancer OH data https://publicapps.odh.ohio.gov/EDW/DataBrowser/Browse/StateLayoutLockdownCancers
oh_cancer = read.csv('./data/lung/oh_state_lung.csv', sep = ";") %>%
janitor::clean_names() %>%
rename(year = cancer_year_year) %>%
select(year, age_adjusted_rate) %>%
map_df(str_replace, pattern = ",", replacement = ".") %>%
map_df(as.numeric) %>%
rename(c("age_adjusted_incidence_rate" = "age_adjusted_rate")) %>%
add_column(state = "OH") %>%
distinct() %>%
filter(complete.cases(.))
#state-level lung cancer NY data https://www.health.ny.gov/statistics/cancer/registry/table2/tb2lungnys.htm
ny_cancer = read.csv("./data/lung/ny_state_lung.csv", skip = 2)[ ,1:3] %>%
janitor::clean_names() %>%
rename(year = x) %>%
select(year, rate_per_100_000_population) %>%
rename(c("age_adjusted_incidence_rate" = "rate_per_100_000_population")) %>%
add_column(state = "NY")
# Read in the data and set the column names
# Add a state column and make all rows = "NY"
# Make the year column of type integer
# Add a cancer type column and make all rows = "melanoma"
# Select the state, year, cancer_type and mf_case_rate columns
# Rename mf_case_rate column "age_adjusted_incidence_rate"
nys_melanoma_age_adjusted <- read_csv(here("data",
"incidence_melanoma_data",
"nys_melanoma_incidence_mortality_year.csv"),
skip = 3,
col_names = c("year",
"mf_cases",
"mf_case_rate",
"mf_case_rate_ci",
"m_cases",
"m_case_rate",
"m_case_rate_ci",
"f_cases",
"f_case_rate",
"f_case_rate_ci",
"mf_deaths",
"mf_death_rate",
"mf_death_rate_ci",
"m_deaths",
"m_death_rate",
"m_death_rate_ci",
"f_deaths",
"f_death_rate",
"f_death_rate_ci"
)) %>%
mutate(state = "NY",
year = as.integer(year),
cancer_type = "melanoma") %>%
select(state, year, cancer_type, mf_case_rate) %>%
rename(age_adjusted_incidence_rate = mf_case_rate)
# Read in data set
ohio_cancer_incidence <- readxl::read_xls(here("data",
"incidence_melanoma_data",
"ohio_cancer_incidence_year.xls"),
skip = 1)
###############################################################################
# Melanoma cases
# Get cases for each year
ohio_cancer_cases <- ohio_cancer_incidence %>%
select("Site/Type", starts_with("Case"))
# Update column names for the Ohio cancer cases data frame
colnames(ohio_cancer_cases) <- c("type", 1996:2018, "total")
# Get melanoma of the skin cancer cases for Ohio
# Pivot longer year
# Select year and cases variables
ohio_melanoma_cases <- ohio_cancer_cases %>%
filter(type == "Melanoma of Skin") %>%
pivot_longer("1996":total,
names_to = "year",
values_to = "cases") %>%
select(year, cases)
###############################################################################
# Melanoma age-adjusted
# Get age-adjusted for each year
ohio_cancer_age_adjusted <- ohio_cancer_incidence %>%
select("Site/Type", starts_with("Age"))
# Update column names for Ohio cancer age-adjusted data frame
colnames(ohio_cancer_age_adjusted) <- c("type", 1996:2018, "total")
# Get melanoma of the skin cancer age-adjusted for Ohio
# Pivot longer year
# Select year and age_adjusted variables
# Remove total row
# Add a state column and set all rows = "OH"
# Convert type of year column to be integer
# Add a cancer type column and set all rows = "melanoma"
# Convert type of age_adjusted_incidence_rate column to be double
# Reorder columns: state, year, cancer_type, age_adjusted_incidence_rate
ohio_melanoma_age_adjusted <- ohio_cancer_age_adjusted %>%
filter(type == "Melanoma of Skin") %>%
pivot_longer("1996":total,
names_to = "year",
values_to = "age_adjusted_incidence_rate") %>%
select(year, age_adjusted_incidence_rate) %>%
filter(year != "total") %>%
mutate(state = "OH",
year = as.integer(year),
cancer_type = "melanoma",
age_adjusted_incidence_rate = as.double(age_adjusted_incidence_rate)) %>%
select(state, year, cancer_type, age_adjusted_incidence_rate)
# Final data frames: ohio_melanoma_cases and ohio_melanoma_age_adjusted
# Read in data, skip first three lines
# Clean variable names
# Select relevant variables
# Add a state column and set all rows = "PA"
# Convert type of year column to be integer
# Add a cancer_type column and set all rows = "melanoma"
# Rename rate_ratio_result column to "age_adjusted_incidence_rate"
# Reorder columns: state, year, cancer_type, age_adjusted_incidence_rate
# Sort data frame by ascending year
pa_melanoma_age_adjusted <- read_csv(here("data",
"incidence_melanoma_data",
"pa_melanoma_incidence_year.csv"),
skip = 3) %>%
janitor::clean_names() %>%
select(year, rate_ratio_result) %>%
mutate(state = "PA",
year = as.integer(year),
cancer_type = "melanoma") %>%
rename(age_adjusted_incidence_rate = rate_ratio_result) %>%
select(state, year, cancer_type, age_adjusted_incidence_rate) %>%
arrange(year)
##### asthma data
#new york asthma data
ny_asthma <-
read_csv("./data/asthma_data/NYS_ASTHMA.csv") %>%
janitor::clean_names()
# ohio asthma data
oh_asthma <-
read_csv("./data/asthma_data/ohio_all.csv") %>%
janitor::clean_names()
#pennsylvania asthma data
pa_asthma <-
read_csv("./data/asthma_data/pa_asthma_all.csv", col_names = TRUE) %>%
janitor::clean_names()
## creating state level data for asthma
#ohio state level data
oh_state <-
oh_asthma %>%
filter(age_group == "All",
month == "All",
race == "Total",
sex == "All",
geography == "Ohio",
visit == "Inpatient",
frequency > 0) %>%
rename(state = geography) %>%
mutate(rate = rate * 10,
outcome = "asthma",
state = "OH") %>%
select(year, state, outcome, rate)
#pennsylvania state level data
pa_state <-
pa_asthma %>%
filter(age == "All Ages",
race == "All Races",
sex == "Total",
geography_code == "0",
rate_adj == "Age-Adjusted Rate") %>%
rename(state = geography) %>%
mutate(outcome = "asthma",
state = "PA") %>%
select(year, state, outcome, rate)
#new york state level data
ny_state <-
ny_asthma %>%
filter(visit == "Inpatient",
geography == "New York State",
age_group == "Total",
sex == "Total",
rate_adj == "Age-Adjusted Rate",
month == "Total",
str_detect(year, "-20", negate = TRUE)) %>%
rename(rate = rate_per_10k) %>%
mutate(rate = round(rate * 10, 2),
state = "NY",
outcome = "asthma") %>%
select(year, state, outcome, rate)
# merged state level asthma
asthma_state <-
rbind(ny_state, oh_state, pa_state) %>%
write_csv("./asthma_state.csv")
read_and_clean_county_data <- function(file_path_name, state, outcome) {
read_csv(file_path_name,
skip = 8) %>%
janitor::clean_names() %>%
select(county, age_adjusted_incidence_rate_rate_note_cases_per_100_000) %>%
mutate(county = stringr::str_remove(county, "\\(.\\)$"),
state = state,
outcome = outcome,
age_adjusted_incidence_rate =
as.numeric(age_adjusted_incidence_rate_rate_note_cases_per_100_000)) %>%
select(state, county, outcome, age_adjusted_incidence_rate)
}
#county-level lung cancer PA data https://statecancerprofiles.cancer.gov/
pa_county_lc = read_and_clean_county_data('./data/lung/pa_lung_data_county.csv',
state = "PA",
outcome = "lung cancer") %>%
filter(complete.cases(.)) %>%
slice(-c(1, 2))
#county-level lung cancer OH data https://statecancerprofiles.cancer.gov/
oh_county_lc = read_and_clean_county_data('./data/lung/oh_lung_data_county.csv',
state = "OH",
outcome = "lung cancer") %>%
filter(complete.cases(.)) %>%
slice(-c(1, 2))
#county-level lung cancer NY data https://statecancerprofiles.cancer.gov/
ny_county_lc = read_and_clean_county_data('./data/lung/ny_lung_data_county.csv',
state = "NY",
outcome = "lung cancer") %>%
filter(complete.cases(.)) %>%
slice(-c(1, 2))
# Read in New York melanoma incidence county-level data averaged from 2014 to
# 2018
nys_county_melanoma_incidence_2014_2018 <-
read_and_clean_county_data(here("data",
"incidence_melanoma_data",
"nys_county_melanoma_incidence_2014_2018.csv"),
"NY",
"melanoma")
# Read in Ohio melanoma incidence county-level data averaged from 2014 to 2018
ohio_county_melanoma_incidence_2014_2018 <-
read_and_clean_county_data(here("data",
"incidence_melanoma_data",
"ohio_county_melanoma_incidence_2014_2018.csv"),
"OH",
"melanoma")
# Read in Pennsylvania melanoma incidence county-level data averaged from
# 2014 to 2018
pa_county_melanoma_incidence_2014_2018 <-
read_and_clean_county_data(here("data",
"incidence_melanoma_data",
"pa_county_melanoma_incidence_2014_2018.csv"),
"PA",
"melanoma")
# new york county level asthma data
ny_co <-
ny_asthma %>%
filter(age_group == "Total",
month == "Total",
year == "2015-2017",
visit == "Inpatient",
rate_adj == "Age-Adjusted Rate",
geography != "New York State",
geography != "New York City") %>%
mutate(county = tolower(geography),
rate = rate_per_10k,
region = "new york",
county = replace(county,
county == "new york county", "new york")) %>%
select(region, county, rate) %>%
group_by(county)
# ohio county level asthma data
oh_co <-
oh_asthma %>%
filter(geography != "Ohio",
age_group == "All",
visit == "Inpatient") %>%
mutate(county = tolower(geography),
region = "ohio") %>%
select(region, county, rate) %>%
group_by(county)
#pennsylvania county data
pa_co <-
pa_asthma %>%
filter(geography_code != "0",
rate_adj == "Age-Adjusted Rate",
year == "2016",
age == "All Ages") %>%
mutate(county = tolower(geography),
region = "pennsylvania",
rate = as.numeric(rate) / 10) %>%
select(region, county, rate) %>%
group_by(county)
# merge county data
county_2016 <-
bind_rows(oh_co, ny_co, pa_co)
Code can be found on the Climate Data page
year
. Year
countyfips
. County-level unique identifier
county
. Name of county
state
. 2-character abbreviation of state name
season
. Period of year in which data was collected
pm25_max_pred
. Daily county-level maximum \(PM_{2.5}\) recorded/simulated, \(\mu g/m^3\)
pm25_med_pred
. Daily county-level median \(PM_{2.5}\) recorded/simulated, \(\mu g/m^3\)
pm25_mean_pred
. Daily county-level mean \(PM_{2.5}\) recorded/simulated, \(\mu g/m^3\)
pm25_pop_pred
. Daily county-level population weighted mean \(PM_{2.5}\) recorded/simulated, \(\mu g/m^3\)
o3_max_pred
. Daily county-level maximum \(O_{3}\) recorded/simulated, \(ppm\)
o3_med_pred
. Daily county-level median \(O_{3}\) recorded/simulated, \(ppm\)
o3_mean_pred
. Daily county-level mean \(O_{3}\) recorded/simulated, \(ppm\)
o3_pop_pred
. Daily county-level population weighted mean \(O_{3}\) recorded/simulated, \(ppm\)
edd
. Daily county-level population weighted erythemally weighted daily dose, \(J/m^2\)
edr
. Daily county-level population weighted erythemally weighted irradiance at local solar noon time, \(mW/m^2\)
i305
. Daily county-level population weighted spectral irradiance at local solar noon time at 305 nm, \(mW/m^2/nm\)
i310
. Daily county-level population weighted spectral irradiance at local solar noon time at 310 nm, \(mW/m^2/nm\)
i324
. Daily county-level population weighted spectral irradiance at local solar noon time at 324 nm, \(mW/m^2/nm\)
i380
. Daily county-level population weighted spectral irradiance at local solar noon time at 380 nm, \(mW/m^2/nm\)
Next, we were able to merge all of our state level outcomes into one data set
# merging state level data for lung cancer
state_lc <- rbind(pa_cancer, oh_cancer, ny_cancer)
state_lc_wide = state_lc %>%
pivot_wider(
names_from = state,
values_from = age_adjusted_incidence_rate) %>%
rename(c("PA_AAIR" = "PA", "NY_AAIR" = "NY", "OH_AAIR" = "OH"))
write_csv(state_lc_wide, "./data/state_lc_wide.csv")
# Merge state-level data frames with bind_rows
state_melanoma_age_adjusted <-
bind_rows(nys_melanoma_age_adjusted,
ohio_melanoma_age_adjusted,
pa_melanoma_age_adjusted)
# read in longitudinal state data for lung and skin cancer
# longitudinal state level lung cancer data
state_lc_data = read_csv("./data/lung/state_lc_data.csv")
# longitudinal state level melanoma data
state_mel_data = read_csv("./data/incidence_melanoma_data/state_melanoma_age_adjusted.csv") %>%
rename(c("outcome" = "cancer_type"))
# combine cancer outcome data
lc_mel_state = bind_rows(state_lc_data, state_mel_data) %>%
select(state, year, outcome, age_adjusted_incidence_rate)
# Read in longitudinal state-level asthma data
# Rename and select columns
asthma_state <- read_csv(here::here("data", "asthma_state.csv")) %>%
rename(age_adjusted_incidence_rate = rate) %>%
select(state, year, outcome, age_adjusted_incidence_rate)
# Bind rows of longitudinal asthma and longitudinal cancer by state and arrange
# accordingly
lc_mel_asthma_state <- bind_rows(lc_mel_state, asthma_state) %>%
arrange(state, year, outcome)
state
. 2-letter abbreviation
year
. Year
outcome
. type of outcome (lung cancer, melanoma, asthma ED visit)
age_adjusted_incidence_rate
. per 100,000
Then, we could do the same with our county-level outcomes to get a merged set.
fips_codes = read.csv('./data/fips_codes.csv') %>%
janitor::clean_names() %>%
filter(state %in% c("NY", "PA", "OH")) %>%
rename(c("county" = "name"))
fips_codes$county[fips_codes$county == "St Lawrence"] <- "St. Lawrence"
county_lc <- rbind(pa_county_lc, oh_county_lc, ny_county_lc) %>%
mutate(county = (gsub('.{7}$', '', county)))
lc_fips <- merge(county_lc, fips_codes, by = c("state", "county"))
# Merge county-level data frames with bind_rows
state_county_melanoma_incidence_2014_2018 <-
bind_rows(nys_county_melanoma_incidence_2014_2018,
ohio_county_melanoma_incidence_2014_2018,
pa_county_melanoma_incidence_2014_2018)
county_lc = read_csv("./data/lung/county_lc.csv")
county_mel = read_csv("./data/incidence_melanoma_data/state_county_melanoma_incidence_2014_2018.csv")
# cleaning to make the data sets compatible
county_mel = county_mel %>%
rename(c("outcome" = "cancer_type")) %>%
drop_na()
county_mel <- county_mel[!(county_mel$county == "Ohio" | county_mel$county == "Pennsylvania" |
county_mel$county == "New York"), ] %>%
filter(!grepl('SEER', county))
county_mel$county <- gsub(" County","", county_mel$county)
county_mel$county[county_mel$county == "St Lawrence"] <- "St. Lawrence"
county_lc$county <- gsub(" County","", county_lc$county)
# binding rows
lc_mel_county = bind_rows(county_lc, county_mel) %>%
filter(complete.cases(.))
# Add FIPS column to the data
fips = fips_codes %>%
janitor::clean_names()
# fips df has 'St Lawrence' instead of 'St. Lawrence'
fips$county[fips$county == "St Lawrence"] <- "St. Lawrence"
lc_mel_county = merge(lc_mel_county, fips, by = c("state", "county"))
# Read in cross-sectional asthma data
# Only get county data for the year 2016 since we need county data at
# a fixed time point (i.e. not longitudinal)
# Adjust column names and types in prep for merge
asthma_county <- read_csv(here::here("data", "asthma_county.csv")) %>%
filter(year == 2016) %>%
select(-year) %>%
rename(age_adjusted_incidence_rate = rate) %>%
mutate(age_adjusted_incidence_rate = as.double(age_adjusted_incidence_rate))
# asthma_county has 'New York County' instead of 'New York' in fips df, and "GAllia" instead of "Gallia"
asthma_county$county[asthma_county$county == "New York County"] <- "New York"
asthma_county$county[asthma_county$county == "GAllia"] <- "Gallia"
asthma_county <- merge(asthma_county, fips, by = c("state", "county"))
# Bind rows of cancer and asthma by county and arrange accordingly
lc_mel_asthma_county <- bind_rows(lc_mel_county, asthma_county) %>%
arrange(state, county, outcome)
state
. 2-letter abbreviation
county
. name of county
outcome
. type of outcome (lung cancer, melanoma, asthma ED visit)
age_adjusted_incidence_rate
. per 100,000
fips
. county-level unique identifier
Once all the data was gathered for both exposures and outcomes of interest, we sought to visualize the data in a cross-sectional manner with incidence rates of outcomes plotted onto maps indicating the extent of exposures for a given year. In this case, we used data from 2016 (with the exception of UV data which was not available and the 2015 data was used in its place). This was accomplished by first plotting maps with a gradient fill to indicate relative levels of the exposure (i.e UV radiation, particulate matter, or low-atmospheric ozone) at the county level. Bubble markers were then overlaid with each bubble’s size and shading indicating the relative incidence of outcomes (i.e. hospitalization due to asthma, melanoma incidence, and lung cancer incidence). These layers were then put into plot_ly
via ggplotly
for interactivity with each marker giving all available data for a given county.
The follow code chunk prepares the data frames for both the map and the data.
# Load relavant libraries and setup
library(viridis)
library(mapproj)
library(maps)
library(housingData)
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis",
knitr.kable.NA = ""
)
scale_colour_discrete = scale_color_viridis_d
scale_fill_discrete = scale_fill_viridis_d
# Prepare map data
state_df <-
map_data("state") %>%
filter(region == "ohio"|
region == "new york" |
region == "pennsylvania" )
county_df <-
map_data("county") %>%
filter(region == "new york" |
region == "pennsylvania" |
region == "ohio")
# obtain list of unique counties with a lat-long measurement associated to it
counties_unique <-
county_df %>%
filter(region == "ohio"|
region == "new york" |
region == "pennsylvania" ) %>%
as.tibble() %>%
distinct(subregion, region, .keep_all = TRUE) %>%
select(subregion, region, everything()) %>%
unite("subreg_reg", subregion:region, remove = FALSE)
# Prepare exposure data
o3 <- read_csv("./ap/ap_uv/o3.csv") %>%
janitor::clean_names() %>%
filter(state == "NY" | state == "OH" | state == "PA",
season == "Summer",
year == 2016) %>%
select(-hover, -season) %>%
mutate(county = tolower(county))
pm25 <- read_csv("./ap/ap_uv/pm25.csv") %>%
janitor::clean_names() %>%
filter(state == "NY" | state == "OH" | state == "PA",
season == "Summer",
year == 2016) %>%
select(-hover, -season) %>%
mutate(county = tolower(county))
# There is no UV data for 2016 so values from 2015 will be used
uv <- read_csv("./ap/ap_uv/edd.csv") %>%
janitor::clean_names() %>%
filter(state == "NY" | state == "OH" | state == "PA",
season == "Summer",
year == 2015) %>%
select(countyfips, uv)
# joing the dataframes for all exposures
exposures <-
o3 %>%
select(countyfips, o3) %>%
left_join(pm25, by = "countyfips") %>%
left_join(uv, by = "countyfips") %>%
rename(region = state,
subregion = county) %>%
mutate(region = str_replace(region, "NY", "new york"),
region = str_replace(region, "PA", "pennsylvania"),
region = str_replace(region, "OH", "ohio")) %>%
select(-year)
# preparing outcomes dataframe (all taken from county level data availabl for 2016)
outcomes <-
read_csv("./data/lc_mel_asthma_county.csv") %>%
rename(rate = age_adjusted_incidence_rate) %>%
pivot_wider(names_from = outcome, values_from = rate) %>%
janitor::clean_names() %>%
rename(region = state,
subregion = county) %>%
mutate(subregion = tolower(subregion),
region = str_replace(region, "NY", "new york"),
region = str_replace(region, "PA", "pennsylvania"),
region = str_replace(region, "OH", "ohio"))
# combine outcomes and expousre into one datafram based on counties
county_df <-
exposures %>%
left_join(outcomes, by = c("subregion", "region")) %>%
select(-fips) %>%
mutate(countyfips = as.character(countyfips)) %>%
select(subregion, region, everything()) %>%
unite("subreg_reg", subregion:region, remove = TRUE)
# combine county datafram with exposure and outcome data to unqiue lat-long (will be used for data points)
centers <-
geoCounty %>%
filter(state == "NY" | state == "PA" | state == "OH") %>%
select(-county, -state) %>%
rename(countyfips = fips,
region = rMapState,
subregion = rMapCounty) %>%
as.tibble
text_df <-
county_df %>%
left_join(centers, by = "countyfips") %>%
janitor::clean_names() %>%
mutate(mytext = paste(
"Location: ", subregion, ", ",region, "\n",
"Melanoma Incidence: ", melanoma, "\n",
"Lung Cancer Incidence: ", lung_cancer, "\n",
"Asthma Hospitalization Incidence: ", asthma, "\n",
"Summer UV (2015): ", uv, "\n",
"Summer PM2.5: ", pm2_5, "\n",
"Summer ozone: ", o3,sep=""))
# prepare combined dataframe for combination to map database with borders for counties (will be used for plotting county-polygons and associated gradient fill)
map_df <-
text_df %>%
select(-lat,-lon, -mytext)
county_map <- map_data("county") %>%
filter(region == "ohio"|
region == "new york" |
region == "pennsylvania" ) %>%
select(subregion, region, everything()) %>%
unite("subreg_reg", subregion:region, remove = TRUE) %>%
left_join(map_df, by = "subreg_reg")
This next code chunk sets up each plot.
# Start plotting
# PM2.5 on the map and datapoints for lung cancer ####
p_pm_lc <-
ggplot(county_map) +
geom_polygon(aes(x = long, y = lat, group = group,
fill = pm2_5),
alpha = 0.6,
color = "black") +
scale_fill_gradient(low = "white", high = "black", name = "PM2.5 (µg/m^3)") +
geom_point(data = text_df,
aes(x = lon, y = lat,
size = lung_cancer,
color = lung_cancer,
text = mytext),
alpha = 0.6
) +
scale_color_viridis(name = "Lung Cancer Incidence\n(per 100 000)") +
scale_alpha_continuous() +
theme_minimal() + coord_map() +
scale_x_continuous("") +
scale_y_continuous("")
p_pm_lc <- ggplotly(p_pm_lc, tooltip = "mytext")
# PM2.5 on the map and datapoints for asthma ####
p_pm_as <-
ggplot(county_map) +
geom_polygon(aes(x = long, y = lat, group = group,
fill = pm2_5),
alpha = 0.6,
color = "black") +
scale_fill_gradient(low = "white", high = "black", name = "PM2.5 (ug/m^3)") +
geom_point(data = text_df,
aes(x = lon, y = lat,
size = asthma,
color = asthma,
text = mytext),
alpha = 0.6
) +
scale_color_viridis(name = "Asthma Hospitalization\nIncidence\n(per 100 000)") +
scale_alpha_continuous() +
theme_minimal() + coord_map() +
scale_x_continuous("") +
scale_y_continuous("")
p_pm_as <- ggplotly(p_pm_as, tooltip = "mytext")
# O3 on the map and datapoints for lung cancer ####
p_o3_lc <-
ggplot(county_map) +
geom_polygon(aes(x = long, y = lat, group = group,
fill = o3),
alpha = 0.6,
color = "black") +
scale_fill_gradient(low = "white", high = "black", name = "Ozone (ppb)") +
geom_point(data = text_df,
aes(x = lon, y = lat,
size = lung_cancer,
color = lung_cancer,
text = mytext),
alpha = 0.7
) +
scale_color_viridis(name = "Lung Cancer Incidence\n(per 100 000)") +
scale_alpha_continuous() +
theme_minimal() + coord_map() +
scale_x_continuous("") +
scale_y_continuous("")
p_o3_lc <- ggplotly(p_o3_lc, tooltip = "mytext")
# O3 on the map and datapoints for asthma ####
p_o3_as <-
ggplot(county_map) +
geom_polygon(aes(x = long, y = lat, group = group,
fill = o3),
alpha = 0.6,
color = "black") +
scale_fill_gradient(low = "white", high = "black", name = "Ozone (ppb)" ) +
geom_point(data = text_df,
aes(x = lon, y = lat,
size = asthma,
color = asthma,
text = mytext),
alpha = 0.7
) +
scale_color_viridis(name = "Asthma Hospitalization\nIncidence\n(per 100 000)") +
scale_alpha_continuous() +
theme_minimal() + coord_map() +
scale_x_continuous("") +
scale_y_continuous("")
p_o3_as <- ggplotly(p_o3_as, tooltip = "mytext")
# UV on the map and datapoints for melanoma ####
p_uv_mel <-
ggplot(county_map) +
geom_polygon(aes(x = long, y = lat, group = group,
fill = uv),
alpha = 0.6,
color = "black") +
scale_fill_gradient(low = "white", high = "black",
name = "UV (J/m^2)\n(erthemally weighted daily dose,\nSummer 2015)" ) +
geom_point(data = text_df,
aes(x = lon, y = lat,
size = melanoma,
color = melanoma,
text = mytext),
alpha = 0.7
) +
scale_color_viridis(name = "Melanoma Incidence\n(per 100 000, 2016)") +
scale_alpha_continuous() +
theme_minimal() + coord_map() +
scale_x_continuous("") +
scale_y_continuous("")
p_uv_mel <- ggplotly(p_uv_mel, tooltip = "mytext")
The plots are as follows:
Summer 2016 mean atmospheric ozone concentration (in ppb) plotted on the map with markers for lung cancer incidence rates in 2016 (age-adjusted per 100,000 people)
p_o3_lc
Summer 2016 mean atmospheric ozone concentration (in ppb) plotted on the map with markers for asthma hospitalization rates in 2016 (age-adjusted per 100,000 people)
p_o3_as
Summer 2016 mean PM 2.5 (in µg/m3) plotted on the map with markers for lung cancer incidence rates in 2016 (age-adjusted per 100,000 people)
p_pm_lc
Summer 2016 mean PM 2.5 (in µg/m3) plotted on the map with markers for asthma hospitalization rates in 2016 (age-adjusted per 100,000 people)
p_pm_as
Summer 2015 mean UV (in J/m2, erthemally weighted daily dose) plotted on the map with markers for melanoma incidence rates in 2016 (age-adjusted per 100,000 people)
p_uv_mel
There are no striking trends in the data but it is noted that some of the highest hospitalization rates due to asthma are located in the NYC metropolitan area where PM2.5 and ozone concentrations were relatively high during the summer of 2016. However, lung cancer rates were grestest in Southern Ohio and Northern New York where there are low to moderate levels of ozone and PM2.5. Melanoma incidence rates for 2016 were overall higher in Ohio where there were also moderate to high levels of UV radiation the year prior in 2015.
These initial cross-sectional visualizations hint at possible correlations but further analyses are warranted to see the extent of associations between our exposures and outcomes of interest if they exist.
More extensive data visualizations and spatial analyses can be found on the Health Outcomes Data and Data Exploration: Map pages.
In addition to the numerous data visualizations seen both in this report, we did extensive work on running regression models and seeing how our exposure data correlated with and could be used to predict our outcomes.
The following code was used to set up our outcome data for these models:
outcomes_state <- read_csv("data/lc_mel_asthma_state.csv")
outcomes_county <- read_csv("data/lc_mel_asthma_county.csv")
ap_uv <- read_csv("ap/ap_uv/apuv.csv")
#State-level analysis - fewer predictors
ap_uv_state_slim <- ap_uv %>%
select(-c(countyfips, county,
pm25_max_pred, pm25_mean_pred, pm25_pop_pred,
o3_max_pred, o3_mean_pred, o3_pop_pred,
i310, i305, i380, edr)) %>%
group_by(state, year, season) %>%
#Take medians across county for each season
summarize(across(pm25_med_pred:edd, median)) %>%
pivot_wider(names_from = season, values_from = pm25_med_pred:edd)
#State-level analysis - more predictors
ap_uv_state <- ap_uv %>%
select(-c(countyfips, county)) %>%
group_by(state, year, season) %>%
#Take medians across county for each season
summarize(across(pm25_max_pred:i380, median)) %>%
pivot_wider(names_from = season, values_from = pm25_max_pred:i380)
outcomes_wider_state <- outcomes_state %>%
pivot_wider(names_from = outcome, values_from = age_adjusted_incidence_rate)
state_analysis <- left_join(outcomes_wider_state, ap_uv_state_slim)
outcomes_wider_county <- outcomes_county %>%
pivot_wider(names_from = outcome, values_from = age_adjusted_incidence_rate)
#county-level analysis - fewer predictors
ap_uv_county_slim <- ap_uv %>%
filter(state %in% c("NY","OH","PA")) %>%
select(-c(pm25_max_pred, pm25_mean_pred, pm25_pop_pred,
o3_max_pred, o3_mean_pred, o3_pop_pred, i324,
i310, i305, i380, edr)) %>%
group_by(county, state, season) %>%
#Take medians across years for each season
summarize(across(pm25_med_pred:edd, median, na.rm = TRUE)) %>%
pivot_wider(names_from = season, values_from = pm25_med_pred:edd)
county_df <- right_join(outcomes_wider_county, ap_uv_county_slim)
county_df2 <- county_df %>%
mutate(pm25_med_pred = rowMeans(select(., starts_with("pm25")))) %>%
select(-c(pm25_med_pred_Fall:pm25_med_pred_Winter))
state_2001_2015 <- state_analysis %>%
filter(year %in% 2001:2015) %>%
#exclude asthma, lots of missingness
select(-asthma)
aq_lung_df <- state_2001_2015 %>%
select(-c(melanoma, starts_with("edd")))
fit_aq <- lm(`lung cancer` ~ . +
year*(o3_med_pred_Spring +
o3_med_pred_Summer +
o3_med_pred_Fall) +
year*(pm25_med_pred_Spring +
pm25_med_pred_Summer +
pm25_med_pred_Fall) +
state*year, data = aq_lung_df)
step_bic_aq <- step(fit_aq, trace = 0, k = log(nobs(fit_aq)), direction = "backward")
## CHART LC MODEL
step_bic_aq %>%
tidy() %>%
mutate(
p.value = scales::pvalue(p.value),
) %>%
kable(
caption = "Coefficient-Level Estimates for a Model Fitted to Estimate Variation in Lung Cancer Rates (Longitudinal).",
col.names = c("Predictor", "B", "SE", "t", "p"),
digits = c(0, 2, 3, 2, 3),
align = c("l", "r", "r", "r", "r")
)
Predictor | B | SE | t | p |
---|---|---|---|---|
(Intercept) | 3402.04 | 485.874 | 7.00 | <0.001 |
stateOH | 1209.53 | 266.457 | 4.54 | <0.001 |
statePA | 543.09 | 196.193 | 2.77 | 0.009 |
year | -1.66 | 0.242 | -6.86 | <0.001 |
pm25_med_pred_Fall | -299.23 | 67.687 | -4.42 | <0.001 |
pm25_med_pred_Spring | -0.66 | 0.193 | -3.42 | 0.002 |
o3_med_pred_Summer | -0.11 | 0.047 | -2.23 | 0.033 |
year:pm25_med_pred_Fall | 0.15 | 0.034 | 4.43 | <0.001 |
stateOH:year | -0.60 | 0.133 | -4.50 | <0.001 |
statePA:year | -0.27 | 0.098 | -2.74 | 0.010 |
uv_df <- state_2001_2015 %>%
filter(year >= 2005) %>%
select(-c(starts_with("pm25"), starts_with("o3"), `lung cancer`))
mel_fit <- lm(melanoma ~ .^2, data = uv_df)
step_bic_mel <- step(mel_fit, trace = 0, k = log(nobs(mel_fit)), direction = "backward")
### CHART MELANOMA MODEL
step_bic_mel %>%
tidy() %>%
mutate(
p.value = scales::pvalue(p.value),
) %>%
kable(
caption = "Coefficient-Level Estimates for a Model Fitted to Estimate Variation in Melanoma Rates (Longitudinal).",
col.names = c("Predictor", "B", "SE", "t", "p"),
digits = c(0, 2, 3, 2, 3),
align = c("l", "r", "r", "r", "r")
)
Predictor | B | SE | t | p |
---|---|---|---|---|
(Intercept) | -3905.70 | 3715.708 | -1.05 | 0.318 |
stateOH | 69.41 | 648.713 | 0.11 | 0.917 |
statePA | -1974.55 | 595.803 | -3.31 | 0.008 |
year | 1.67 | 1.774 | 0.94 | 0.367 |
edd_Fall | 4.84 | 2.199 | 2.20 | 0.052 |
edd_Spring | -3.22 | 1.205 | -2.67 | 0.023 |
edd_Summer | 0.16 | 0.062 | 2.58 | 0.027 |
edd_Winter | 11.39 | 5.538 | 2.06 | 0.067 |
stateOH:year | -0.15 | 0.320 | -0.46 | 0.653 |
statePA:year | 0.93 | 0.288 | 3.23 | 0.009 |
stateOH:edd_Fall | 0.03 | 0.013 | 2.52 | 0.030 |
statePA:edd_Fall | 0.01 | 0.006 | 2.24 | 0.049 |
stateOH:edd_Spring | 0.02 | 0.009 | 1.80 | 0.102 |
statePA:edd_Spring | 0.00 | 0.005 | 0.53 | 0.608 |
stateOH:edd_Summer | 0.04 | 0.013 | 3.05 | 0.012 |
statePA:edd_Summer | 0.02 | 0.009 | 2.78 | 0.019 |
year:edd_Fall | 0.00 | 0.001 | -2.15 | 0.057 |
year:edd_Spring | 0.00 | 0.001 | 2.77 | 0.020 |
year:edd_Winter | -0.01 | 0.003 | -2.04 | 0.068 |
edd_Fall:edd_Spring | 0.00 | 0.000 | 1.64 | 0.131 |
edd_Fall:edd_Summer | 0.00 | 0.000 | -2.65 | 0.024 |
edd_Fall:edd_Winter | 0.00 | 0.000 | -1.60 | 0.140 |
edd_Spring:edd_Summer | 0.00 | 0.000 | -2.07 | 0.066 |
For both exposure/outcome pairings, we worked to understand their relationship through regression models.
After paring down our longitudinal lung cancer model with BIC, we observed an \(R^2\) value of 0.97, which felt unreasonably high. Model diagnostics were conducted to assess normality and scedasticity, and check for high-leverage points, with plots viewable on the Regression page.
With our longitudinal melanoma BIC model, we observed an adjusted \(R^2\) value of 0.88, which also raised some red flags. Similarly to the lung cancer model, model diagnostics for the melanoma model are also viewable on the Regression page.
aq_lung_df2 <- county_df2 %>%
select(-c(melanoma, asthma, starts_with("edd"), county, fips))
fit_aq_cty <- lm(`lung cancer` ~ .^2, data = aq_lung_df2)
step_bic_aq_cty <- step(fit_aq_cty, trace = 0, k = log(nobs(fit_aq)), direction = "backward")
### CHART LC MODEL
step_bic_aq_cty %>%
tidy() %>%
mutate(
p.value = scales::pvalue(p.value),
) %>%
kable(
caption = "Coefficient-Level Estimates for a Model Predicting Lung Cancer as a Function of Air Quality (Cross-Sectional).",
col.names = c("Predictor", "B", "SE", "t", "p"),
digits = c(0, 2, 3, 2, 3),
align = c("l", "r", "r", "r", "r")
)
Predictor | B | SE | t | p |
---|---|---|---|---|
(Intercept) | -479.64 | 191.389 | -2.51 | 0.013 |
stateOH | 4.80 | 57.925 | 0.08 | 0.934 |
statePA | 157.73 | 58.087 | 2.72 | 0.007 |
o3_med_pred_Fall | -34.91 | 15.597 | -2.24 | 0.026 |
o3_med_pred_Spring | 36.95 | 14.429 | 2.56 | 0.011 |
o3_med_pred_Winter | 16.65 | 7.277 | 2.29 | 0.023 |
stateOH:o3_med_pred_Fall | 13.13 | 2.687 | 4.89 | <0.001 |
statePA:o3_med_pred_Fall | 7.90 | 2.713 | 2.91 | 0.004 |
stateOH:o3_med_pred_Spring | -9.58 | 2.798 | -3.43 | <0.001 |
statePA:o3_med_pred_Spring | -9.39 | 2.632 | -3.57 | <0.001 |
o3_med_pred_Fall:o3_med_pred_Winter | 1.02 | 0.511 | 1.99 | 0.048 |
o3_med_pred_Spring:o3_med_pred_Winter | -1.09 | 0.482 | -2.26 | 0.025 |
With an \(R^2\) value of 0.37, this lung cancer model appears to be far more reasonable with our expectations. Again, multiple diagnostic plots were used to assess scedasticity and normality, observable on the Regression page.
mel_cty <- county_df2 %>%
select(state, melanoma, starts_with("edd"))
fit_mel_cty <- lm(melanoma ~ .^2 +
I(edd_Spring^3) +
I(edd_Summer^3) +
I(edd_Fall^3) +
I(edd_Winter^3), data = mel_cty)
step_bic_mel_cty <- step(fit_mel_cty, trace = 0, k = log(nobs(fit_aq)), direction = "backward")
### CHART MELANOMA MODEL
step_bic_mel_cty %>%
tidy() %>%
mutate(
p.value = scales::pvalue(p.value),
) %>%
kable(
caption = "Coefficient-Level Estimates for a Model Predicting Melanoma as a Function of UV Radiation (Cross-Sectional).",
col.names = c("Predictor", "B", "SE", "t", "p"),
digits = c(0, 2, 3, 2, 3),
align = c("l", "r", "r", "r", "r")
)
Predictor | B | SE | t | p |
---|---|---|---|---|
(Intercept) | 86.42 | 98.855 | 0.87 | 0.383 |
stateOH | -29.45 | 10.614 | -2.77 | 0.006 |
statePA | -44.57 | 7.531 | -5.92 | <0.001 |
edd_Spring | 0.09 | 0.023 | 3.87 | <0.001 |
edd_Summer | -0.06 | 0.029 | -2.08 | 0.039 |
edd_Winter | -0.27 | 0.217 | -1.23 | 0.219 |
stateOH:edd_Winter | 0.06 | 0.023 | 2.55 | 0.012 |
statePA:edd_Winter | 0.10 | 0.016 | 6.13 | <0.001 |
edd_Spring:edd_Winter | 0.00 | 0.000 | -3.66 | <0.001 |
edd_Summer:edd_Winter | 0.00 | 0.000 | 2.48 | 0.014 |
This melanoma model has a fairly small \(R^2\) of 0.23. Residuals and Q-Q plots are available for reference on the Regression page for further assessment.
asth_df <- county_df2 %>%
filter(!is.na(asthma))
aq_asth_df <- asth_df %>%
select(-c(melanoma, `lung cancer`, starts_with("edd"), county, fips))
fit_asth_cty <- lm(asthma ~ .^2, data = aq_asth_df)
step_bic_asth_cty <- step(fit_asth_cty, trace = 0, k = log(nobs(fit_aq)), direction = "backward")
aq_asth_df2 <- aq_asth_df %>%
slice(-c(3,31,185))
fit_asth_cty <- lm(asthma ~ .^2, data = aq_asth_df2)
step_bic_asth_cty <- step(fit_asth_cty, trace = 0, k = log(nobs(fit_aq)), direction = "backward")
#summary(step_bic_asth_cty)
### CHART Asthma MODEL
step_bic_asth_cty %>%
tidy() %>%
mutate(
p.value = scales::pvalue(p.value),
) %>%
kable(
caption = "Coefficient-Level Estimates for a Model Predicting Asthma Hospitalizations as a Function of Air Quality (Cross-Sectional).",
col.names = c("Predictor", "B", "SE", "t", "p"),
digits = c(0, 2, 3, 2, 3),
align = c("l", "r", "r", "r", "r")
)
Predictor | B | SE | t | p |
---|---|---|---|---|
(Intercept) | 523.02 | 56.888 | 9.19 | <0.001 |
stateOH | 42.66 | 76.517 | 0.56 | 0.578 |
statePA | -82.53 | 94.068 | -0.88 | 0.381 |
o3_med_pred_Fall | -0.67 | 2.669 | -0.25 | 0.801 |
o3_med_pred_Spring | -6.75 | 2.229 | -3.03 | 0.003 |
o3_med_pred_Winter | -4.56 | 1.621 | -2.82 | 0.005 |
stateOH:o3_med_pred_Fall | 12.00 | 3.556 | 3.37 | <0.001 |
statePA:o3_med_pred_Fall | -4.70 | 3.818 | -1.23 | 0.220 |
stateOH:o3_med_pred_Winter | -16.80 | 4.367 | -3.85 | <0.001 |
statePA:o3_med_pred_Winter | 7.27 | 2.463 | 2.95 | 0.004 |
This asthma model has a reasonable \(R^2\) of 0.597. Residuals and Q-Q plots are available for reference on the Regression page for further assessment.
Additional visualizations of these models are avalable. Please visit our Regression page, or scroll to our Discussion for an overview of the findings.
Generally, we can conclude from our spatial visualizations that the highest concentrations of lung cancer, skin cancer, and asthma hospitalizations in our three-state observation window all resided in different areas. We observed high asthma hospitalization rates in NYC, where moderate to high levels of particulate matter and ozone were seen in the same year. Melanoma rates in Ohio were similarly high, alongside the reasonably high rates of UV radiation. However, contrary to expectation, we saw the highest rates of lung cancer in areas of Ohio and New York State where rates of ozone and particulate matter were generally low.
More can be found on the regression page, but we can derive much information from our final models, viewable in this report. We found that, through our diagnostic plots for the lung cancer model that the residuals were centered and relatively even with no extreme outliers. When we cross-validated using leave-one-out and Monte-Carlo, our BIC fit was best, though we still suspect overfitting. Additionally, it is important to acknowledge that this data set does not account for many of the relevant predictors for lung cancer, including age, race, genetics, and, perhaps most importantly, smoking.
Additionally, for our melanoma model, the residuals were slightly heteroscedastic and minimally curved but not unreasonable. The Q-Q plot failed to display normality, though, and there were multiple significant outliers with high leverage. Even after removing certain predictors and transforming new models, the BIC fit model was still the most predictive in both leave-one-out and Monte Carlo cross-validation. We, again, suspect overfitting but cannot confirm this without running more tests and expanding our research. Additionally, we did not employ methods specific to longitudinal or time-series data, such as generalized least squares or distributed lag models, nor did we attempt penalized methods such as the lasso.
Our models using cross-sectional, county-level data made far more sense.
For lung cancer, the residuals of the cross-sectional model are much more reasonable. The residuals are well dispersed, the Q-Q plot is far more normal than the longitudinal model, and there are similarly no high outliers. Yet again, the backwards-selected BIC model is our best-fit. We can also see that the strongest term in the model is the interaction between state and ozone, which may indicate potential policy and industry differences by state.
When considering the melanoma cross-sectional model, we see that there are well dispersed residuals, a relatively normal Q-Q plot, and no high leverage points. Additionally, we see significant and important model terms in the form of median EDD values in the winter and spring. This was an interesting finding, as you may expect summer EDD, when it is most likely to be sunny, to have the most effect. However, this relationship may change based on sunscreen use. Individuals may be more likely to use sunscreen in the summer, leaving their skin more vulnerable in the off-season months.
Finally, we consider our asthma cross-sectional model. In the diagnostic plots visible on the Regression page, we see that there are numerous outliers that led us to filter out the points and re-model. We end up with a well-distributed residual plot but notable abnormality. Again, we see that BIC is the best model through cross-validation, with a significant interaction between states and ozone levels.
Although we focused our attention on three neighboring states, availability of surveillance data for chronic health conditions varies greatly across these three states. These variations in data collection make it difficult for comparison across states, as the scales may not be comparable or the metric for measuring incidence in the population may differ between states. For instance, asthma incidence can be reported with respect to acute cases as measured by emergency department visits or more specifically with rates of hospitalization due exclusively to asthma. Consequently, when comparing exposures as potential risk factors for asthma across the states and counties, we were limited to only looking at hospitalizations because publicly available Pennsylvania data only reported hospitalizations and not emergency department visits. Similarly, surveillance data may not collect demographic data that have been shown to be effect measure modifiers of the health outcomes. For instance, gender, race, and ethnicity have been reported in academic literature as important factors influencing differences observed in incidence rates of lung cancer globally.
These important effect measure modifiers have been discussed not only in academic literature (e.g., Schabeth et al. 2016,), Meza et al. (2015), and O’Keeffe et al. (2018) ) but also in reports by health organizations like the American Cancer Society) and in data visualizations by the World Health Organization. However, not all counties or states may report these relevant data as routinely if at all to provide a complete picture of the disease incidence.