Hypothesis testing

Lecture 23

Author
Affiliation

Josh Lim

Duke University
STA 199 Spring 2026

Published

April 15, 2026

While you wait: Participate 📱💻

TBD

  • TBD
  • TBD

Scan the QR code or go HERE. Log in with your Duke NetID.

This ain’t Zito

Hi, I’m Josh!

  • 2nd Year PhD Student in Statistical Science
  • Teaching this course in summer
  • Fun fact: I’m a professional Saja Boy (Kpop Demon Hunter)

Zito the GOAT

Hypothesis testing

Hypothesis testing

A hypothesis test is a statistical technique used to evaluate competing claims using data

  • Null hypothesis, \(H_0\): An assumption about the population. “There is nothing going on.”

  • Alternative hypothesis, \(H_A\): A research question about the population. “There is something going on”.

. . .

Note: Hypotheses are always at the population level!

Running Example

glimpse(volleyball)
Rows: 25
Columns: 4
$ name   <chr> "Maguilaura Frias", "Maria Elena Aguilera", "Mitzy Natalia Gonz…
$ height <dbl> 186, 153, 168, 173, 162, 190, 180, 184, 188, 195, 186, 175, 183…
$ spike  <dbl> 291, 260, 211, 296, 263, 290, 308, 298, 305, 312, 298, 292, 301…
$ block  <dbl> 280, 240, 209, 286, 253, 285, 231, 295, 295, 300, 285, 281, 288…
  • Volleyball Dataset on FIVB Athletes
  • Spike: How high an athlete touches on a spike (cm)
  • Block: How high an athlete touches on a block (cm)
  • Height (cm)

Running Example cont

Question: Is an athlete’s spike touch significantly related to their height?

observed_fit <- volleyball |>
  specify(spike ~ height) |>
  fit()

observed_fit
# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept   0.0540
2 height      1.62  

Setting hypotheses

  • Null hypothesis, \(H_0\): “There is nothing going on.” The slope of the model for predicting the spike touch of FIVB athletes from their heights is 0, \(\beta_1 = 0\).

  • Alternative hypothesis, \(H_A\): “There is something going on”. The slope of the model for predicting the spike touch of FIVB athletes from their heights is different than, \(\beta_1 \ne 0\).

Hypothesis testing “mindset”

  • Assume you live in a world where null hypothesis is true: \(\beta_1 = 0\).

  • Ask yourself how likely you are to observe the sample statistic, or something even more extreme, in this world: \(P(b_1 \leq -1.62~or~b_1 \geq 1.62 | \beta_1 = 0)\) = ?

Hypothesis testing as a court trial

  • Null hypothesis, \(H_0\): Defendant is innocent

  • Alternative hypothesis, \(H_A\): Defendant is guilty

. . .

  • Present the evidence: Collect data

. . .

  • Judge the evidence: “Could these data plausibly have happened by chance if the null hypothesis were true?”
    • Yes: Fail to reject \(H_0\)
    • No: Reject \(H_0\)

Hypothesis testing as medical diagnosis

  • Null hypothesis, \(H_0\): patient is fine

  • Alternative hypothesis, \(H_A\): patient is sick

. . .

  • Present the evidence: Collect data

. . .

  • Judge the evidence: “Could these data plausibly have happened by chance if the null hypothesis were true?”
    • Yes: Fail to reject \(H_0\)
    • No: Reject \(H_0\)

Hypothesis testing framework

  • Start with a null hypothesis, \(H_0\), that represents the status quo

  • Set an alternative hypothesis, \(H_A\), that represents the research question, i.e. what we’re testing for

  • Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)

    • if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis
    • if they do, then reject the null hypothesis in favor of the alternative

Calculate observed slope

… which we have already done:

observed_fit <- volleyball |>
  specify(spike ~ height) |>
  fit()

observed_fit
# A tibble: 2 × 2
  term      estimate
  <chr>        <dbl>
1 intercept   0.0540
2 height      1.62  

Simulate null distribution

set.seed(24601)
null_dist <- volleyball |>
  specify(spike ~ height) |>
  hypothesize(null = "independence") |>
  generate(reps = 10000, type = "permute") |>
  fit()

Wait, so what’s a null distribution?

  • Simulating a universe under \(H_0\): the slope of the model for predicting the spike touch of FIVB athletes from their heights is 0, \(\beta_1 = 0\). I.e., no relationship between spike and height.

  • If I “shuffle” the order of the spike variable, is there a relationship between spike and height?

  • No! Spike is essentially random and unrelated to anything!

  • Shuffle the spike variable, fit a model and get a slope, record the slope, repeat!

  • More on this in STA 221L

View null distribution

null_dist
# A tibble: 20,000 × 3
# Groups:   replicate [10,000]
   replicate term      estimate
       <int> <chr>        <dbl>
 1         1 intercept 292.    
 2         1 height      0.0132
 3         2 intercept 301.    
 4         2 height     -0.0377
 5         3 intercept 402.    
 6         3 height     -0.594 
 7         4 intercept 366.    
 8         4 height     -0.395 
 9         5 intercept 449.    
10         5 height     -0.853 
# ℹ 19,990 more rows

Visualize null distribution

null_dist |>
  filter(term == "height") |>
  ggplot(aes(x = estimate)) +
  geom_histogram(binwidth = 0.2)

Visualize null distribution (alternative)

visualize(null_dist) +
  shade_p_value(obs_stat = observed_fit, direction = "two-sided")

Get p-value

null_dist |>
  get_p_value(obs_stat = observed_fit, direction = "two-sided")
# A tibble: 2 × 2
  term      p_value
  <chr>       <dbl>
1 height      0.002
2 intercept   0.002

Make a decision

Based on the p-value calculated, what is the conclusion of the hypothesis test?

Since \(p = 0.002\), we reject the null hypothesis and have evidence that there is a statistically significant relationship between an FIVB players spike touch and height.

Sometimes the test will be wrong

Think about the judge

\(H_0\) person innocent vs \(H_A\) person guilty

Think about the doctor

\(H_0\) person well vs \(H_A\) person sick.

How do we negotiate the trade-off?

Pick a threshold \(\alpha\in[0,\,1]\) called the discernibility level and threshold the \(p\)-value:

  • If \(p\text{-value} < \alpha\), reject null and accept alternative;
  • If \(p\text{-value} \geq \alpha\), fail to reject null;

. . .