
Lecture 22
Duke University
STA 199 Spring 2026
2026-04-13
The line on the left is the 8% quantile of the data, and the line on the right is the 87% quantile of the data. What fraction of the data lives between the two lines?

Scan the QR code or go HERE. Log in with your Duke NetID.
If JZ had to boil statistics down to one main idea, it would be:
quantifying uncertainty to help make decisions.
You make different kinds of decisions if you’re sure versus unsure. Statistics helps you quantify the reliability of your knowledge so that you can determine what sort of decision to make.
Different data give different estimates. How different?
That’s the main idea in a nutshell.
Recall the openintro::loans_full_schema data frame:
10000 rows;
each row is an approved loan applicant;
the columns contain financial info about that person, including…
What would you guess is the direction of association between these two variables?
(I just took logs to make the picture prettier.)
As the sample size grew, the best fit line stabilized;
As the sample size grew, the grey uncertainty band shrank;
As the sample size grew, we observed a larger range of income values, anf the computer displayed more of the line;
As the sample size grows, the picture the data paint becomes clearer:
Which would you rather have for your data analysis? 5 people in your dataset or 9947? Why?
We do not know what the “true” line is;
Our estimates are a best guess based on noisy, incomplete, imperfect data;
The more data we have, the more “certain” and “reliable” the estimates are;
What do we mean by “uncertainty” here?
Fact: different data set -> different estimates;
How much would our estimate vary across alternative datasets?
These tiny data sets can’t even agree on if the line should slope up or down. Uncertainty is high, hence the large bands.
If we repeat the process with a larger sample size, things are more stable

# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 4.27
2 log_inc 0.553


# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) -4.63
2 log_inc 1.35


# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) -1.14
2 log_inc 1.04


# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) -0.288
2 log_inc 0.960


# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 4.84
2 log_inc 0.492


# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 3.20
2 log_inc 0.654



The amount of variation in the histogram tells us something about the uncertainty, and gives us a range of likely values.
In order to do this in practice, we need multiple datasets. But in practice we only have one. So now what?
We approximate this idea of “alternative, hypothetical datasets I could have observed” by resampling our data with replacement;
We construct a new dataset of the same size by randomly picking rows out of the original one:
Repeat this processes hundred or thousands of times, and observe how the estimates vary as you refit the model on alternative datasets;
This gives you a sense of the sampling variability of your estimates.
Original data
# A tibble: 6 × 3
id x y
<int> <dbl> <dbl>
1 1 0.432 1.53
2 2 -2.01 1.80
3 3 -0.0467 1.43
4 4 -1.05 0.0518
5 5 0.327 0.820
6 6 -0.679 -0.961
Original estimates:
# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 0.801
2 x 0.0450
Sample with replacement:
# A tibble: 6 × 3
id x y
<int> <dbl> <dbl>
1 5 0.327 0.820
2 6 -0.679 -0.961
3 6 -0.679 -0.961
4 1 0.432 1.53
5 6 -0.679 -0.961
6 1 0.432 1.53
Different data >> new estimates:
# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 0.462
2 x 2.11
Original data
# A tibble: 6 × 3
id x y
<int> <dbl> <dbl>
1 1 0.432 1.53
2 2 -2.01 1.80
3 3 -0.0467 1.43
4 4 -1.05 0.0518
5 5 0.327 0.820
6 6 -0.679 -0.961
Original estimates:
# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 0.801
2 x 0.0450
Sample with replacement:
# A tibble: 6 × 3
id x y
<int> <dbl> <dbl>
1 2 -2.01 1.80
2 5 0.327 0.820
3 1 0.432 1.53
4 6 -0.679 -0.961
5 3 -0.0467 1.43
6 2 -2.01 1.80
Different data >> new estimates:
# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 0.913
2 x -0.236
Original data
# A tibble: 6 × 3
id x y
<int> <dbl> <dbl>
1 1 0.432 1.53
2 2 -2.01 1.80
3 3 -0.0467 1.43
4 4 -1.05 0.0518
5 5 0.327 0.820
6 6 -0.679 -0.961
Original estimates:
# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 0.801
2 x 0.0450
Sample with replacement:
# A tibble: 6 × 3
id x y
<int> <dbl> <dbl>
1 6 -0.679 -0.961
2 1 0.432 1.53
3 5 0.327 0.820
4 6 -0.679 -0.961
5 6 -0.679 -0.961
6 5 0.327 0.820
Different data >> new estimates:
# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 0.357
2 x 1.96
openintro::duke_forest

Goal: Use the area (in square feet) to understand variability in the price of houses in Duke Forest.
df_fit <- linear_reg() |>
fit(price ~ area, data = duke_forest)
tidy(df_fit) |>
kable(digits = 2) # neatly format table to 2 digits| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |
| area | 159.48 | 18.17 | 8.78 | 0.00 |
For each additional square foot, we expect the sale price of Duke Forest houses to be higher by $159, on average.
Statistical inference provide methods and tools so we can use the single observed sample to make valid statements (inferences) about the population it comes from
For our inferences to be valid, the sample should be random and representative of the population we’re interested in
Calculate a confidence interval for the slope, \(\beta_1\) (today)
Conduct a hypothesis test for the slope, \(\beta_1\) (Thursday)
A confidence interval will allow us to make a statement like “For each additional square foot, the model predicts the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus X dollars.”
Should X be $10? $100? $1000?
If we were to take another sample of 98 would we expect the slope calculated based on that sample to be exactly $159? Off by $10? $100? $1000?
The answer depends on how variable (from one sample to another sample) the sample statistic (the slope) is
We need a way to quantify the variability of the sample statistic
for estimation










so on and so forth…




Fill in the blank: For each additional square foot, the model predicts the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus ___ dollars.
Fill in the blank: For each additional square foot, we expect the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus ___ dollars.
How confident are you that the true slope is between $0 and $250? How about $150 and $170? How about $90 and $210?
Quantiles!
Think IQR! 50% of the bootstrap distribution is between the 25% quantile on the left and the 75% quantile on the right. But we want more than 50%
90% of the bootstrap distribution is between the 5% quantile on the left and the 95% quantile on the right;
95% of the bootstrap distribution is between the 2.5% quantile on the left and the 97.5% quantile on the right;
And so on.
Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-22-duke-forest-bootstrap.qmd.
Work through the application exercise in class, and render, commit, and push your edits.
Calculate the observed slope:
Take 100 bootstrap samples and fit models to each one:
set.seed(1120)
boot_fits <- duke_forest |>
specify(price ~ area) |>
generate(reps = 100, type = "bootstrap") |>
fit()
boot_fits# A tibble: 200 × 3
# Groups: replicate [100]
replicate term estimate
<int> <chr> <dbl>
1 1 intercept 47819.
2 1 area 191.
3 2 intercept 144645.
4 2 area 134.
5 3 intercept 114008.
6 3 area 161.
7 4 intercept 100639.
8 4 area 166.
9 5 intercept 215264.
10 5 area 125.
# ℹ 190 more rows
Percentile method: Compute the 95% CI as the middle 95% of the bootstrap distribution:
If we want to be very certain that we capture the population parameter, should we use a wider or a narrower interval? What drawbacks are associated with using a wider interval?

How can we get best of both worlds – high precision and high accuracy?
How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval?
## confidence level: 90%
get_confidence_interval(
boot_fits, point_estimate = observed_fit,
level = 0.90, type = "percentile"
)# A tibble: 2 × 3
term lower_ci upper_ci
<chr> <dbl> <dbl>
1 area 104. 212.
2 intercept -24380. 256730.
## confidence level: 99%
get_confidence_interval(
boot_fits, point_estimate = observed_fit,
level = 0.99, type = "percentile"
)# A tibble: 2 × 3
term lower_ci upper_ci
<chr> <dbl> <dbl>
1 area 56.3 226.
2 intercept -61950. 370395.
Population: Complete set of observations of whatever we are studying, e.g., people, tweets, photographs, etc. (population size = \(N\))
Sample: Subset of the population, ideally random and representative (sample size = \(n\))
Sample statistic \(\ne\) population parameter, but if the sample is good, it can be a good estimate
Statistical inference: Discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) process
We report the estimate with a confidence interval, and the width of this interval depends on the variability of sample statistics from different samples from the population
Since we can’t continue sampling from the population, we bootstrap from the one sample we have to estimate sampling variability