Hypothesis testing

Lecture 23

Author

Affiliation

Josh Lim

Duke University
STA 199 Spring 2026

Published

April 15, 2026

While you wait: Participate 📱💻

Sampling uncertainty refers to…

uncertainty about the accuracy of data collection
uncertainty about whether our statistical model is true
sensitivity of results to human error
variation in conclusions across different models
variation in estimates across alternative datasets

Scan the QR code or go HERE. Log in with your Duke NetID.

This ain’t Zito

Hi, I’m Josh!

2nd Year PhD Student in Statistical Science
Teaching this course in summer
Fun fact: I’m a professional Saja Boy (Kpop Demon Hunters)

Zito the GOAT

Hypothesis testing

A hypothesis test is a statistical technique used to evaluate competing claims using data

Null hypothesis, \(H_0\): An assumption about the population. “There is nothing going on.”
Alternative hypothesis, \(H_A\): A research question about the population. “There is something going on”.

. . .

Note: Hypotheses are always at the population level!

Running Example

glimpse(volleyball)

Rows: 25
Columns: 4
$ name   <chr> "Maguilaura Frias", "Maria Elena Aguilera", "Mitzy Natalia Gonz…
$ height <dbl> 186, 153, 168, 173, 162, 190, 180, 184, 188, 195, 186, 175, 183…
$ spike  <dbl> 291, 260, 211, 296, 263, 290, 308, 298, 305, 312, 298, 292, 301…
$ block  <dbl> 280, 240, 209, 286, 253, 285, 231, 295, 295, 300, 285, 281, 288…

Volleyball Dataset on FIVB Athletes
Spike: How high an athlete touches on a spike (cm)
Block: How high an athlete touches on a block (cm)
Height (cm)

Running Example cont

Question: Is an athlete’s spike touch significantly related to their height?

observed_fit <- volleyball |>
  specify(spike ~ height) |>
  fit()

observed_fit

# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept   0.0540
2 height      1.62

Setting hypotheses

Null hypothesis, \(H_0\): “There is nothing going on.” The slope of the model for predicting the spike touch of FIVB athletes from their heights is 0, \(\beta_1 = 0\).
Alternative hypothesis, \(H_A\): “There is something going on”. The slope of the model for predicting the spike touch of FIVB athletes from their heights is different than, \(\beta_1 \ne 0\).

Hypothesis testing “mindset”

Assume you live in a world where null hypothesis is true: \(\beta_1 = 0\).
Ask yourself how likely you are to observe the sample statistic, or something even more extreme, in this world: \(P(b_1 \leq -1.62~or~b_1 \geq 1.62 | \beta_1 = 0)\) = ?

Hypothesis testing as a court trial

Null hypothesis, \(H_0\): Defendant is innocent
Alternative hypothesis, \(H_A\): Defendant is guilty

. . .

Present the evidence: Collect data

. . .

Judge the evidence: “Could these data plausibly have happened by chance if the null hypothesis were true?”
- Yes: Fail to reject \(H_0\)
- No: Reject \(H_0\)

Hypothesis testing as medical diagnosis

Null hypothesis, \(H_0\): patient is fine
Alternative hypothesis, \(H_A\): patient is sick

. . .

Present the evidence: Collect data

. . .

Judge the evidence: “Could these data plausibly have happened by chance if the null hypothesis were true?”
- Yes: Fail to reject \(H_0\)
- No: Reject \(H_0\)

Hypothesis testing framework

Start with a null hypothesis, \(H_0\), that represents the status quo
Set an alternative hypothesis, \(H_A\), that represents the research question, i.e. what we’re testing for
Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)
- if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis
- if they do, then reject the null hypothesis in favor of the alternative

Calculate observed slope

… which we have already done:

observed_fit <- volleyball |>
  specify(spike ~ height) |>
  fit()

observed_fit

# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept   0.0540
2 height      1.62

Simulate null distribution

set.seed(24601)
null_dist <- volleyball |>
  specify(spike ~ height) |>
  hypothesize(null = "independence") |>
  generate(reps = 10000, type = "permute") |>
  fit()

Wait, so what’s a null distribution?

Simulating a universe under \(H_0\): the slope of the model for predicting the spike touch of FIVB athletes from their heights is 0, \(\beta_1 = 0\). I.e., no relationship between spike and height.
If I “shuffle” the order of the spike variable, is there a relationship between spike and height? Could height predict spike touch?

Participate 📱💻

If I “shuffle” the order of the spike variable, is there a relationship between spike and height? Could height predict spike touch?

On Average, Yes
On Average, No
Never
I want lunch, stop talking about volleyball

Scan the QR code or go HERE. Log in with your Duke NetID.

Shuffling is cool, trust

On avergage, no
Shuffling breaks any link between spike and height (on average)
type = "permute"
Shuffle the spike variable, fit a model and get a slope, record the slope, repeat!
If the null is true, the distribution of these slopes should be centered on zero!
More on this in STA 221L
What if we shuffled height instead? Both?

Every day I’m shuffling

ggplot(volleyball, aes(x = height, y = spike)) +
  geom_point() + 
  geom_smooth(method = "lm")

Every day I’m shuffling

ggplot(volleyball, aes(x = height, y = sample(spike))) +
  geom_point() + 
  geom_smooth(method = "lm")

Wait, what’s that new function

rouges_gallery <- c("Josh Lim", "John Zito", "Sarah Wu", "Katie Solarz")
rouges_gallery

[1] "Josh Lim"     "John Zito"    "Sarah Wu"     "Katie Solarz"

. . .

set.seed(8675309)

sample(rouges_gallery)

[1] "Katie Solarz" "Sarah Wu"     "John Zito"    "Josh Lim"

What if this was my shuffle?

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)  336.   
2 height        -0.231

What if this was my shuffle instead?

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)  389.   
2 height        -0.526

What if this was my shuffle instead?

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)  242.   
2 height         0.291

Rinse and repeat 10000 times…

View null distribution

null_dist

# A tibble: 20,000 × 3
# Groups:   replicate [10,000]
   replicate term      estimate
       <int> <chr>        <dbl>
 1         1 intercept 292.    
 2         1 height      0.0132
 3         2 intercept 301.    
 4         2 height     -0.0377
 5         3 intercept 402.    
 6         3 height     -0.594 
 7         4 intercept 366.    
 8         4 height     -0.395 
 9         5 intercept 449.    
10         5 height     -0.853 
# ℹ 19,990 more rows

Visualize null distribution

null_dist |>
  filter(term == "height") |>
  ggplot(aes(x = estimate)) +
  geom_histogram(binwidth = 0.2)

Reminder: observed fit

What was our estimate to begin with?

observed_fit <- volleyball |>
  specify(spike ~ height) |>
  fit()

observed_fit

# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept   0.0540
2 height      1.62

Visualize null distribution (alternative)

visualize(null_dist) +
  shade_p_value(obs_stat = observed_fit, direction = "two-sided")

Get p-value

null_dist |>
  get_p_value(obs_stat = observed_fit, direction = "two-sided")

# A tibble: 2 × 2
  term      p_value
  <chr>       <dbl>
1 height      0.002
2 intercept   0.002

Make a decision

Based on the p-value calculated, what is the conclusion of the hypothesis test?

Probability of observing your data or something more extreme given the null is true
Since \(p = 0.002\), we reject the null hypothesis and have evidence that there is a statistically significant relationship between an FIVB players spike touch and height.

How small is small enough?

Pick a threshold \(\alpha\in[0,\,1]\) called the discernibility level and threshold the \(p\)-value:

If \(p\text{-value} < \alpha\), reject null and accept alternative;
If \(p\text{-value} \geq \alpha\), fail to reject null;

Sometimes the test will be wrong

Think about the judge

\(H_0\) person innocent vs \(H_A\) person guilty

Think about the doctor

\(H_0\) person well vs \(H_A\) person sick.

While you wait: Participate 📱💻

This ain’t Zito

Zito the GOAT

Hypothesis testing

Hypothesis testing

Running Example

Running Example cont

Setting hypotheses

Hypothesis testing “mindset”

Hypothesis testing as a court trial

Hypothesis testing as medical diagnosis

Hypothesis testing framework

Calculate observed slope

Simulate null distribution

Wait, so what’s a null distribution?

Participate 📱💻

Shuffling is cool, trust

Every day I’m shuffling

Every day I’m shuffling

Wait, what’s that new function

What if this was my shuffle?

What if this was my shuffle instead?

What if this was my shuffle instead?

Rinse and repeat 10000 times…

View null distribution

Visualize null distribution

Reminder: observed fit

Visualize null distribution (alternative)

Get p-value

Make a decision

How small is small enough?

Sometimes the test will be wrong

Think about the judge

Think about the doctor

Where does alpha come in to this?