My goal for this project was to practice using R for data analysis. This is why I chose to explore a rather obvious hypothesis: more people will visit U.S. national parks when the weather is warm. The following analysis confirms that, when the temperature is warmer, U.S. national parks tend to receive a higher number of visitors. This effect is more highly pronounced in states that have colder average temperatures.

The following project follows the data analysis process taught in the Google Data Analytics Professional Certificate program on Coursera: ask, prepare, process, analyse, share, act. If you only view one section of this project, I recommend that it be the Share section. Please note that portions of the code from this project have been hidden from this page in order to make it more aesthetically pleasing. Please view the .Rmd file for the full source code.

1. Ask

The goal of this project is to determine whether or not temperature affects the number of recreational visits a national park receives in a given month, and whether or not this effect differs from state to state.

2. Prepare

Parks data source:
https://irma.nps.gov/Stats/

Temperature data source:
https://www.kaggle.com/datasets/justinrwong/average-monthly-temperature-by-us-state?resource=download

States data source:
https://worldpopulationreview.com/states/state-abbreviations

# Parks data
parks_data <- read.csv("parks_data.csv", header = TRUE, sep = ",")

# Temperature data
temp_data <- read.csv("temp_data.csv", header = TRUE, sep = ",")

# States data
states_data <- read.csv("states_data.csv", header = TRUE, sep = ",")

3. Process

# Remove unwanted parks_data rows and columns (State == "" for Alaska and national monuments)
parks_data <- parks_data %>% 
  select(c(5:8)) %>% 
  filter(State != "")

# Add StateName column to parks_data
parks_data$StateName <- states_data$state[match(parks_data$State, states_data$code)]
  
# Add Date column to parks_data, format: YYYY-MM
parks_data$Date <- paste(parks_data$Year, parks_data$Month, sep = "-")
parks_data$Date <- ifelse(nchar(parks_data$Date)==6,gsub("-", "-0", parks_data$Date), parks_data$Date)

# Add date column to temp_data, format: YYYY-MM
temp_data$date <- paste(temp_data$year, temp_data$month, sep = "-")
temp_data$date <- ifelse(nchar(temp_data$date)==6,gsub("-", "-0", temp_data$date), temp_data$date)

# Add AvgTemp column to parks_data
parks_data$MatchCode <- paste(parks_data$Date, parks_data$StateName, sep = "-")
temp_data$match_code <- paste(temp_data$date, temp_data$state, sep = "-")
parks_data$AvgTemp <- temp_data$average_temp[match(parks_data$MatchCode, temp_data$match_code)]

# Reorder columns
parks_data <- parks_data[ , c(6,2,3,2,5,1,8,4)]

# Convert RecreationVisits to integer
parks_data$RecreationVisits <- gsub(",","",parks_data$RecreationVisits)
parks_data$RecreationVisits <- as.integer(parks_data$RecreationVisits)

# Remove NA data (Alaska, American Samoa, DC, Guam, Hawaii, Puerto Rico, Virgin Islands)
parks_data <- parks_data[!is.na(parks_data$AvgTemp), ]

# Change Month to abbreviation
parks_data <- parks_data %>% 
  mutate(Month = as.numeric(Month),
         Month = month.abb[Month])

# Aggregate the data so that all rows have a unique combination of date and state
parks_data <- parks_data %>% 
  group_by(Date, Year, Month, State, StateName) %>%
  summarize(Temp = mean(AvgTemp), Visits = sum(RecreationVisits))

# Convert F to C
parks_data$Temp <- round((5/9) * (parks_data$Temp - 32), 0)

4. Analyze

The r and r^2 of Temp ~ Visits are low. This is because there is data from states with many recreational visits (e.g. CA, NC, NY) that towers over the data from states with fewer visits, pulling down the overall trend line. If parks_data is aggregated on Date (so that all Visits from each State are summed and Temp is averaged), the relationship of Temp ~ Visits becomes stronger and more meaningful.

# Plot Temp ~ Visits to show a weak correlation
plot(parks_data$Temp, parks_data$Visits)

# Display r and r^2 of Temp ~ Visits
print(paste0("r: ", cor(parks_data$Temp, parks_data$Visits),
  "    r^2: ", summary(lm(Temp ~ Visits, data = parks_data))$r.squared))
## [1] "r: 0.326560854448697    r^2: 0.106641991658263"
# Display a bar graph to show that a handful of states receive the majority of park visits
parks_data %>%
  ggplot(aes(x = State, y = Visits)) +
  geom_bar(stat = "identity")

# Group data by date
pd_by_date <- parks_data %>%
  group_by(Date) %>% 
  summarise(Visits = sum(Visits), Temp = mean(Temp))

# Plot Temp ~ Visits again
plot(pd_by_date$Temp, pd_by_date$Visits)

# Display r and r^2 of Temp ~ Visits
print(paste0("r: ", cor(pd_by_date$Temp, pd_by_date$Visits),
  "    r^2: ", summary(lm(Temp ~ Visits, data = pd_by_date))$r.squared))
## [1] "r: 0.914125538227189    r^2: 0.835625499639149"

Looking at the following list of Temp ~ Visits correlations by state, it appears that colder states have higher correlations, and hotter states have lower correlations. Plotting the correlation strength (r) of each state against that state’s average temperature shows that r values seem to lower as average state temperature increases (especially so once average state temperature reaches ~17°C).

# Compare Temp ~ Visits correlations by state (4 strongest cor vs 4 weakest cor)
cor_find(parks_data, "StateName") %>%
  filter(min_rank(CorTempVisits) <= 4 | rank(desc(CorTempVisits)) <= 4  ) %>% 
  arrange(desc(CorTempVisits))
##        StateName      CorTempVisits
## 1 North Carolina  0.963064146750479
## 2           Ohio  0.919871960834953
## 3  West Virginia  0.913962045609577
## 4   North Dakota  0.905412905774083
## 5        Alabama   0.47933497730571
## 6          Texas   0.45227354368609
## 7        Florida  0.127684927619436
## 8      Louisiana 0.0274071784206093
# Plot Temp ~ Visits correlations
plot(temp_v_cor$Temp, temp_v_cor$CorTempVisits)

5. Share

It can be seen in the following plots that higher temperatures tend to be associated with more park visits. In the two plots showing data from the hottest four states, this correlation is much weaker (and the trend line slope is much flatter). The correlation is stronger in the colder states. Since the number of visits a park receives can never fall below zero, the visits data hits a floor at a temperature of ~5°C (making the trend line less steep than it probably should be).

# Make a 2X2 grid of the 4 plots
plot_grid(
CorGraph(all_states,"United States"),
CorGraph(moderate_states,"Moderate States"),
CorGraph(hot_states,"Hottest States"),
CorGraph(cold_states,"Coldest States")
)

# Get the legend (which is hidden in the theme)
legend <- get_legend(
    DoubleGraph(all_states,"United States") +
    guides(color = guide_legend(nrow = 1)) +
    theme(legend.position = "bottom", legend.margin = margin(1,1,1,1))
)

# Make a 2X2 grid of the 4 plots
grid <- plot_grid(
  DoubleGraph(all_states,"United States"),
  DoubleGraph(moderate_states,"Moderate States"),
  DoubleGraph(hot_states,"Hottest States"),
  DoubleGraph(cold_states,"Coldest States")
)

# Plot the 2X2 grid with the legend
plot_grid(grid, legend, ncol = 1, rel_heights = c(1, .1))

More vizualizations can be found on this project’s accompanying Tableau dashboard.

6. Act

As outlined above, the reason I completed this project was so that I could practice using R. I had no actionable goal in mind when I chose to analyze this data. If I were to assume what actions the U.S. National Park Service could take based on this analysis, it could be that they would prepare for more recreational visits in warmer months by hiring more seasonal staff. If the National Park Service is interested in increasing the number of visits to their parks, they could do more to entice people to visit during colder months. This could be achieved by offering more winter activities, such as snowshoeing or cross-country skiing.