Logistic regression 3

Lecture 21

John Zito

Duke University
STA 199 Spring 2026

2026-04-06

While you wait: Participate 📱💻

Consider a fitted logistic regression:

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right)=b_0+b_1x. \]

Which of the following is a correct description of how the predictions of a logistic regression model change when the predictor changes from \(x\) to \(x+1\)?

\(\hat{p}^{(new)} = \hat{p}^{(old)}\times e^{b_1}\)
\(\hat{y}^{(new)} = \hat{y}^{(old)} + b_1\)
\(\widehat{log~odds}^{(new)} = \widehat{log~odds}^{(old)}\times e^{b_1}\)
\(\widehat{odds}^{(new)} = \widehat{odds}^{(old)}\times e^{b_1}\)
\(\hat{y}^{(new)} = \hat{y}^{(old)} \times b_1\)

Scan the QR code or go HERE. Log in with your Duke NetID.

JZ plans your weekend for you

Wednesday: SSMU Talk

The old broads in the stats mafia host an even older broad from the tidy mafia to scare you about your future
Wednesday April 8 @ 4:30 pm
Old Chem 116

Wednesday: Duke Symphony Concert

Duke Symphony Orchestra
Wednesday April 8 @ 7:30 pm
Baldwin Auditorium (East Campus)
WANANA classmates and TAs!

Thursday: Six Characters in Search of an Author

Play by Luigi Pirandello
April 9 - 11 @ 8 p.m
April 12 @ 2pm
Sheafer Lab Theater here in Bryan Center

Friday: Duke Chinese Dance

Insta: @dukechinesedance
Friday April 10 @ 7pm
Page Auditorium
Classmates! TAs! Josh!

Friday: Momentum Showcase

Insta: @momentum_duke
Friday April 10 @ 7pm
Reynolds Theatre
Lily! Katarina!

Saturday: DBBH’s Spring Business Conference

Duke Business Behind Health (DBBH)
Saturday April 11
In the business school
Great if you’re pre-med, pre-biotech, etc
Rub elbows with the muckety-mucks!

Saturday: Devils en Pointe and Embodiment

Devils en Pointe
Embodiment
Saturday April 11 @ 7pm
Reynolds Industries Theater
Classmates! More Shelly Han!

Saturday: Nakisai Showcase

Insta: @dukenakisaiade
Saturday April 11 @ 7pm
Page Auditorium
#DOMOREBELIT

Saturday: Pureun Showcase

Insta: @duke_pureun
Saturday April 11 @ 8:15pm
Page Auditorium
Makky! Mia!

Sunday: Ishq Showcase

Insta: @duke.ishq
Sunday April 12 @ 3pm
Reynolds Theater
Maaany WANANA alumni

Sunday: Chinese Music Ensemble

Duke Chinese Music Ensemble
Sunday April 12 @ 5pm
Nelson Music Room (East Campus)
Eric!

Midterm 2

Midterm Exam 2

Worth 20% of your final grade; consists of two parts:

In-class: worth 80% of the Midterm 2 grade;
- All multiple choice;
- Wednesday April 8 11:45 AM - 1:00 PM in this room;
Take-home: worth 20% of the Midterm 2 grade.
- Like a mini HW;
- Completely open resource, but citation policies apply, and collaboration of any kind is forbidden;
- Released Thursday April 9 at 6:00 pm;
- Due Sunday April 12 @ 11:59 pm.

If you take every exam and do better on the final than you did on this, we replace it.

FYIs

Very similar in length, style, and format to Midterm 1;
The emphasis is placed heavily on the modeling material introduced after spring break, but you don’t get to forget the coding tools from before (we’ll see example today);
- (That said, studying pivots and joins is not the best use of your time);
OH are canceled while the take-home is live. Seek help by posting privately on Ed;

Study advice

Study guide
Work on your cheat sheet;
Correct old labs and homeworks;
Old AEs: complete tasks we didn’t get to and compare with key;
Code along: watch these videos specifically;
Odd-numbered exercises in the back of IMS Chs. 7 - 9.

Cheat sheet advice

Pictures pictures pictures;
Instead of just lists of function names with verbal descriptions, give yourself detailed examples where you can see the input, see the code, see the output, and you annotate precisely how and why the code did what it did.

Misconduct reminder

Inappropriate collaboration will result in a zero on the entire take-home and be referred to the conduct office;
That zero will not be dropped or replaced;
If a conduct violation of any kind is discovered, your final letter grade in the course will be permanently reduced (A- down to B+, B+ down to B, etc);
If folks share solutions, all students involved will be penalized equally, the sharer the same as the recipient.

Misconduct reminder

You initial this on the take-home document:

I hereby state that I have not communicated with or gained information in any way from my classmates or any other humans other than JZ during this exam, that all work is my own, and that I have properly cited any non-course resources I have used.

Searching for ambiguity in this statement after the fact is not a winning strategy.

Logistic wrap-up

`forested` data

7107 rows and 20 columns;
Each observation (row) is a plot of land;
Variables include geographical and meteorological information about each plot, as well as a binary indicator forested (“Yes” or “No”);
Given information about a plot that is easy (and cheap) to collect remotely, can we use a model to predict if a plot is forested without actually visiting it (which could be difficult and costly)?

`forested` data

library(forested)
glimpse(forested)

Rows: 7,107
Columns: 20
$ forested         <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
$ year             <dbl> 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005,…
$ elevation        <dbl> 881, 113, 164, 299, 806, 736, 636, 224, 52, 2240, 104…
$ eastness         <dbl> 90, -25, -84, 93, 47, -27, -48, -65, -62, -67, 96, -4…
$ northness        <dbl> 43, 96, 53, 34, -88, -96, 87, -75, 78, -74, -26, 86, …
$ roughness        <dbl> 63, 30, 13, 6, 35, 53, 3, 9, 42, 99, 51, 190, 95, 212…
$ tree_no_tree     <fct> Tree, Tree, Tree, No tree, Tree, Tree, No tree, Tree,…
$ dew_temp         <dbl> 0.04, 6.40, 6.06, 4.43, 1.06, 1.35, 1.42, 6.39, 6.50,…
$ precip_annual    <dbl> 466, 1710, 1297, 2545, 609, 539, 702, 1195, 1312, 103…
$ temp_annual_mean <dbl> 6.42, 10.64, 10.07, 9.86, 7.72, 7.89, 7.61, 10.45, 10…
$ temp_annual_min  <dbl> -8.32, 1.40, 0.19, -1.20, -5.98, -6.00, -5.76, 1.11, …
$ temp_annual_max  <dbl> 12.91, 15.84, 14.42, 15.78, 13.84, 14.66, 14.23, 15.3…
$ temp_january_min <dbl> -0.08, 5.44, 5.72, 3.95, 1.60, 1.12, 0.99, 5.54, 6.20…
$ vapor_min        <dbl> 78, 34, 49, 67, 114, 67, 67, 31, 60, 79, 172, 162, 70…
$ vapor_max        <dbl> 1194, 938, 754, 1164, 1254, 1331, 1275, 944, 892, 549…
$ canopy_cover     <dbl> 50, 79, 47, 42, 59, 36, 14, 27, 82, 12, 74, 66, 83, 6…
$ lon              <dbl> -118.6865, -123.0825, -122.3468, -121.9144, -117.8841…
$ lat              <dbl> 48.69537, 47.07991, 48.77132, 45.80776, 48.07396, 48.…
$ land_type        <fct> Tree, Tree, Tree, Tree, Tree, Tree, Non-tree vegetati…
$ county           <fct> Ferry, Thurston, Whatcom, Skamania, Stevens, Stevens,…

Read the documentation:

?forested

Goal

Use the data we’ve already seen to predict if a yet-to-be-observed plot of land is forested;
We want a model that does well on data it has never seen before;
“Out-of-sample” predictions on new data are more useful than “in-sample” predictions on old data;

Training versus testing data

To mimic this “out-of-sample” idea, we randomly split the data into two parts:

training data: this is what the model gets to see when we fit it;
test data: withheld. We assess how well the trained model can predict on this data it hasn’t seen before.

Randomly split data into training and test sets

By default it’s a 75%/25% training/test split.

set.seed(8675309)

forested_split <- initial_split(forested)

forested_train <- training(forested_split)
forested_test <- testing(forested_split)

The split is random, but we want the results to be reproducible, so we “freeze the random numbers in time” by setting a seed. If we don’t tell you exactly what seed to use on an assignment, you can pick any positive integer you want.

Explore: forested or not

ggplot(forested_train, aes(x = lon, y = lat, color = forested)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("Yes" = "forestgreen", "No" = "gold2")) +
  theme_minimal()

Explore: annual precipitation

ggplot(forested_train, aes(x = lon, y = lat, color = precip_annual)) +
  geom_point(alpha = 0.7) +
  labs(color = "annual\nprecipitation\n(mm × 100)") +
  theme_minimal()

FYI: the response variable must be a factor

forested already comes as a factor, so we’re lucky:

class(forested$forested)

[1] "factor"

levels(forested$forested)

[1] "Yes" "No"

But if it didn’t, things would not work:

logistic_reg() |>
  fit(as.numeric(forested) ~ precip_annual, data = forested_train)

Error in `check_outcome()`:
! For a classification model, the outcome should be a <factor>, not a
  double vector.

FYI: the base level is treated as “failure” (0)

The base level here is “Yes”, so “No” is treated as “success” (1):

levels(forested$forested)

[1] "Yes" "No"

As a result, this code:

logistic_reg() |>
  fit(forested ~ precip_annual, data = forested_train)

Corresponds to this model:

\[ \widehat{\text{Prob}( \texttt{forested = "No"} \mid x)} = \frac{e^{b_0+b_1 x}}{1 + e^{b_0+b_1 x}}. \]

This is not a problem, but it means that in order to interpret the output correctly, you need to understand how your factors are leveled.

Fitting a logistic regression model

Similar syntax to linear regression:

forested_precip_fit <- logistic_reg() |>
  fit(forested ~ precip_annual, data = forested_train)

tidy(forested_precip_fit)

# A tibble: 2 × 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    1.57    0.0557         28.2 6.89e-175
2 precip_annual -0.00190 0.0000602     -31.6 5.98e-219

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.57 - 0.0019 \times precip. \]

Three equivalent representations

Version	equation	Friendly LHS?	Friendly RHS?
Probability: (0, 1)	\[\hat{p}=\frac{e^{b_0+b_1x}}{1+e^{b_0+b_1x}}\]	✅	❌
Odds: \((0,\,\infty)\)	\[\frac{\hat{p}}{1-\hat{p}}=e^{b_0+b_1x}\]	🥴	🥴
Log-odds: \((-\infty,\,\infty)\)	\[\log\left(\frac{\hat{p}}{1-\hat{p}}\right)=b_0+b_1x\]	❌	✅

Take your pick depending on what you’re trying to do.

Probability to log-odds

\(p = 0 \to \text{log-odds}=-\infty\);
\(p = 1/2 \to \text{log-odds}=0\);
\(p = 1 \to \text{log-odds}=\infty\);
And everything in between.

The log-odds transformation takes probabilities between 0 and 1 and streeetches them out to numbers between \(-\infty\) and \(\infty\), for which the linear model is appropriate.

Interpreting the intercept

If precip = 0, then…

\[ \widehat{\text{Prob}( \texttt{forested = "No"} \mid precip = 0)} = \frac{e^{1.57}}{1 + e^{1.57}}\approx 0.83 \]

So when \(precip = 0\), the model predicts that the probability of forested = "No" is about 83%.

Note

We didn’t interpret the intercept \(b_0=1.57\) directly. We transformed it into an interpretable quantity using the probability version of the logsitic regression model.

Interpreting the slope

Odds at \(precip\):

\[ \frac{\hat{p}}{1-\hat{p}} = {\color{blue}{e^{1.57 - 0.0019 \times precip}}} \]

Odds at \(precip + 1\):

\[ \begin{aligned} \frac{\hat{p}}{1-\hat{p}} &= e^{1.57 - 0.0019 \times (precip + 1)} \\ &= e^{1.57 - 0.0019 \times precip - 0.0019} \\ &= {\color{blue}{e^{1.57 - 0.0019 \times precip}}} \times \color{red}{e^{-0.0019}} \end{aligned} \]

If \(precip\) increases by one unit, the model predicts a decrease in the odds that forested = "No" by a multiplicative factor of \(e^{-0.0019}\approx 0.99\).

Generate predictions for the test data

Augment the test data frame with three new columns on the left that include model predictions (classifications and probabilities) for each row:

forested_precip_aug <- augment(forested_precip_fit, forested_test)
forested_precip_aug

# A tibble: 1,777 × 23
   .pred_class .pred_Yes .pred_No forested  year elevation eastness northness
   <fct>           <dbl>    <dbl> <fct>    <dbl>     <dbl>    <dbl>     <dbl>
 1 Yes             0.711  0.289   No        2005       164      -84        53
 2 Yes             0.968  0.0319  Yes       2003      1031      -49        86
 3 Yes             0.992  0.00806 Yes       2005      1330       99         7
 4 No              0.266  0.734   No        2014       507       44       -89
 5 No              0.263  0.737   No        2014       542      -32       -94
 6 No              0.267  0.733   No        2014       759       -2       -99
 7 No              0.232  0.768   No        2014       119        0         0
 8 No              0.241  0.759   No        2014       419       86       -49
 9 No              0.336  0.664   Yes       2014       569      -97       -21
10 No              0.279  0.721   No        2014       340      -54        83
# ℹ 1,767 more rows
# ℹ 15 more variables: roughness <dbl>, tree_no_tree <fct>, dew_temp <dbl>,
#   precip_annual <dbl>, temp_annual_mean <dbl>, temp_annual_min <dbl>,
#   temp_annual_max <dbl>, temp_january_min <dbl>, vapor_min <dbl>,
#   vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>, land_type <fct>,
#   county <fct>

How did the model perform?

These are the four possibilities:

Our test data have the truth in the forested column;
We can compare the predictions in .pred_class to the true values and see how we did.

Getting the error rates

Count how many predictions fell into the four bins:

forested_precip_aug |>
  count(forested, .pred_class)

# A tibble: 4 × 3
  forested .pred_class     n
  <fct>    <fct>       <int>
1 Yes      Yes           683
2 Yes      No            290
3 No       Yes           140
4 No       No            664

Still need all the tools from before spring break!

Getting the error rates

Compute the error rates:

forested_precip_aug |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n)
  )

# A tibble: 4 × 4
# Groups:   forested [2]
  forested .pred_class     n     p
  <fct>    <fct>       <int> <dbl>
1 Yes      Yes           683 0.702
2 Yes      No            290 0.298
3 No       Yes           140 0.174
4 No       No            664 0.826

Still need all the tools from before spring break!

Getting the error rates

Label the rows so we don’t go crazy:

forested_precip_aug |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n),
    decision = case_when(
      forested == "Yes" & .pred_class == "Yes" ~ "sensitivity",
      forested == "Yes" & .pred_class == "No" ~ "false negative",
      forested == "No" & .pred_class == "Yes" ~ "false positive",
      forested == "No" & .pred_class == "No" ~ "specificity",
    )
  )

# A tibble: 4 × 5
# Groups:   forested [2]
  forested .pred_class     n     p decision      
  <fct>    <fct>       <int> <dbl> <chr>         
1 Yes      Yes           683 0.702 sensitivity   
2 Yes      No            290 0.298 false negative
3 No       Yes           140 0.174 false positive
4 No       No            664 0.826 specificity

FYI: error rates sum to 1 within the levels f the true outcome

For these two, the denominator is the number of truly “zero” cases:
- TNR: proportion of “zero” cases correctly classified as “zero;”
- FPR: proportion of “zero” cases incorrectly classified as “one;”
For these two, the denominator is the number of truly “one” cases:
- FNR: proportion of “one” cases incorrectly classified as “zero,”
- TPR: proportion of “one” cases correctly classified as “one.”

So TNR + FPR = 1 and FNR + TPR = 1.

That’s why we did group_by(forested).

FYI: the default threshold is 50%

The model produces probabilities: .pred_Yes and .pred_No;
The concrete classifications in the .pred_class column come from applying a 50% threshold to these probabilities:

\[ \widehat{\texttt{forested}}= \begin{cases} \texttt{"No"} & \texttt{.pred\_Yes} \leq 0.5\\ \texttt{"Yes"} & \texttt{.pred\_Yes} > 0.5. \end{cases} \]

If you want to override that default, you must do so manually.