Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'infer' was built under R version 4.5.2
Warning: package 'parsnip' was built under R version 4.5.2
The data for this application exercise comes from the lterdatasampler package. The mission of the Long Term Ecological Research program (LTER) Network is to “provide the scientific community, policy makers, and society with the knowledge and predictive understanding necessary to conserve, protect, and manage the nation’s ecosystems, their biodiversity, and the services they provide.”
Specifically we’ll be using data from the North Temperate Lakes LTER (NTL-LTER) site, which is located in the Madison, WI area, modeling the relationship between number of days that a lake is frozen, excluding periods where the lake thaws before refreezing again, and annual average temperature.
We will use the tidyverse package for data visualization and wrangling and the tidymodels package for modeling.
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'infer' was built under R version 4.5.2
Warning: package 'parsnip' was built under R version 4.5.2
The data can be found in the data folder; it’s called icecover.csv.
icecover <- read_csv("data/icecover.csv")The data dictionary is below:
| Variable Name | Description |
|---|---|
lakeid |
Lake name |
ice_on |
Date of freeze of lake |
ice_off |
Date of ice breakup of lake |
ice_duration |
Number of days between the freeze and breakup dates of each lake |
year |
Year of observation |
annual_avg_temp |
Annual average air temperature (°C) |
We’re going to investigate the relationship between ice_duration and annual_avg_temp.
ggplot(icecover, aes(x = annual_avg_temp, y = ice_duration)) +
geom_point() +
labs(
x = "Annual average temperature (°C)",
y = "Ice duration (days)",
title = "Ice duration vs. annual average temperature",
subtitle = "North Temperate Lakes LTER"
)
If you were to draw a straight line to best represent the relationship between ice duration and annual average temperature, where would it go? Why?
Now, let R draw the line for you. Refer to the documentation at https://ggplot2.tidyverse.org/reference/geom_smooth.html. Specifically, refer to the method section.
ggplot(icecover, aes(x = annual_avg_temp, y = ice_duration)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
x = "Annual average temperature (°C)",
y = "Ice duration (days)",
title = "Ice duration vs. annual average temperature",
subtitle = "North Temperate Lakes LTER"
)`geom_smooth()` using formula = 'y ~ x'

What types of questions can this plot help answer?
We can use this line to make predictions. Predict what you think the ice duration would be in a year with annual average temperature of 7, 10, and 12 °C. Which prediction is considered extrapolation?
Fit a model to predict fish weights from their heights.
ice_temp_fit <- linear_reg() |>
fit(ice_duration ~ annual_avg_temp, data = icecover)
ice_temp_fitparsnip model object
Call:
stats::lm(formula = ice_duration ~ annual_avg_temp, data = data)
Coefficients:
(Intercept) annual_avg_temp
148.110 -6.182
tidy(ice_temp_fit)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 148. 7.77 19.1 1.38e-53
2 annual_avg_temp -6.18 1.01 -6.10 3.31e- 9
Before, you used your eyeballs to predict what the ice duration would be in a year with annual average temperature of 7, 10, and 12 °C. Now let’s have the computer do it.
R as an overgrown calculator to compute the model prediction for each of the three temperatures by plugging into your model formulas and doing the arithmetic:148.109981 - 6.181554 * 7[1] 104.8391
148.109981 - 6.181554 * 10[1] 86.29444
148.109981 - 6.181554 * 12[1] 73.93133
predict function to perform the same calculation. You should get the same numbers (aside from rounding error).We can also assess correlation between two quantitative variables.
What is correlation? What are values correlation can take?
What is the correlation between temperature and ice duration?
We can fit more models than just a straight line. Change the plotting code from earlier to use method = "loess". What is different from the plot created before?
ggplot(icecover, aes(x = annual_avg_temp, y = ice_duration)) +
geom_point() +
geom_smooth(method = "loess") +
labs(
x = "Annual average temperature (°C)",
y = "Ice duration (days)",
title = "Ice duration vs. annual average temperature",
subtitle = "North Temperate Lakes LTER"
)`geom_smooth()` using formula = 'y ~ x'
