Logistic regression 3

Lecture 21

John Zito

Duke University
STA 199 Spring 2026

2026-04-06

While you wait: Participate 📱💻

Consider a fitted logistic regression:

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right)=b_0+b_1x. \]

Which of the following is a correct description of how the predictions of a logistic regression model change when the predictor changes from \(x\) to \(x+1\)?

  • \(\hat{p}^{(new)} = \hat{p}^{(old)}\times e^{b_1}\)
  • \(\hat{y}^{(new)} = \hat{y}^{(old)} + b_1\)
  • \(\widehat{log~odds}^{(new)} = \widehat{log~odds}^{(old)}\times e^{b_1}\)
  • \(\widehat{odds}^{(new)} = \widehat{odds}^{(old)}\times e^{b_1}\)
  • \(\hat{y}^{(new)} = \hat{y}^{(old)} \times b_1\)

Scan the QR code or go HERE. Log in with your Duke NetID.

JZ plans your weekend for you

Wednesday: SSMU Talk

  • The old broads in the stats mafia host an even older broad from the tidy mafia to scare you about your future
  • Wednesday April 8 @ 4:30 pm
  • Old Chem 116

Wednesday: Duke Symphony Concert

Thursday: Six Characters in Search of an Author

Friday: Duke Chinese Dance

  • Insta: @dukechinesedance
  • Friday April 10 @ 7pm
  • Page Auditorium
  • Classmates! TAs! Josh!

Friday: Momentum Showcase

  • Insta: @momentum_duke
  • Friday April 10 @ 7pm
  • Reynolds Theatre
  • Lily! Katarina!

Saturday: DBBH’s Spring Business Conference

  • Duke Business Behind Health (DBBH)
  • Saturday April 11
  • In the business school
  • Great if you’re pre-med, pre-biotech, etc
  • Rub elbows with the muckety-mucks!

Saturday: Devils en Pointe and Embodiment

  • Devils en Pointe
  • Embodiment
  • Saturday April 11 @ 7pm
  • Reynolds Industries Theater
  • Classmates! More Shelly Han!

Saturday: Nakisai Showcase

  • Insta: @dukenakisaiade
  • Saturday April 11 @ 7pm
  • Page Auditorium
  • #DOMOREBELIT

Saturday: Pureun Showcase

  • Insta: @duke_pureun
  • Saturday April 11 @ 8:15pm
  • Page Auditorium
  • Makky! Mia!

Sunday: Ishq Showcase

  • Insta: @duke.ishq
  • Sunday April 12 @ 3pm
  • Reynolds Theater
  • Maaany WANANA alumni

Sunday: Chinese Music Ensemble

  • Duke Chinese Music Ensemble
  • Sunday April 12 @ 5pm
  • Nelson Music Room (East Campus)
  • Eric!

Midterm 2

Midterm Exam 2

Worth 20% of your final grade; consists of two parts:

  • In-class: worth 80% of the Midterm 2 grade;

    • All multiple choice;
    • Wednesday April 8 11:45 AM - 1:00 PM in this room;
  • Take-home: worth 20% of the Midterm 2 grade.

    • Like a mini HW;
    • Completely open resource, but citation policies apply, and collaboration of any kind is forbidden;
    • Released Thursday April 9 at 6:00 pm;
    • Due Sunday April 12 @ 11:59 pm.

If you take every exam and do better on the final than you did on this, we replace it.

FYIs

  • Very similar in length, style, and format to Midterm 1;
  • The emphasis is placed heavily on the modeling material introduced after spring break, but you don’t get to forget the coding tools from before (we’ll see example today);
    • (That said, studying pivots and joins is not the best use of your time);
  • OH are canceled while the take-home is live. Seek help by posting privately on Ed;

Study advice

  • Study guide
  • Work on your cheat sheet;
  • Correct old labs and homeworks;
  • Old AEs: complete tasks we didn’t get to and compare with key;
  • Code along: watch these videos specifically;
  • Odd-numbered exercises in the back of IMS Chs. 7 - 9.

Cheat sheet advice

  • Pictures pictures pictures;
  • Instead of just lists of function names with verbal descriptions, give yourself detailed examples where you can see the input, see the code, see the output, and you annotate precisely how and why the code did what it did.

Misconduct reminder

  • Inappropriate collaboration will result in a zero on the entire take-home and be referred to the conduct office;
  • That zero will not be dropped or replaced;
  • If a conduct violation of any kind is discovered, your final letter grade in the course will be permanently reduced (A- down to B+, B+ down to B, etc);
  • If folks share solutions, all students involved will be penalized equally, the sharer the same as the recipient.

Misconduct reminder

You initial this on the take-home document:

I hereby state that I have not communicated with or gained information in any way from my classmates or any other humans other than JZ during this exam, that all work is my own, and that I have properly cited any non-course resources I have used.

Searching for ambiguity in this statement after the fact is not a winning strategy.

Logistic wrap-up

forested data

  • 7107 rows and 20 columns;

  • Each observation (row) is a plot of land;

  • Variables include geographical and meteorological information about each plot, as well as a binary indicator forested (“Yes” or “No”);

  • Given information about a plot that is easy (and cheap) to collect remotely, can we use a model to predict if a plot is forested without actually visiting it (which could be difficult and costly)?

forested data

library(forested)
glimpse(forested)
Rows: 7,107
Columns: 20
$ forested         <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
$ year             <dbl> 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005,…
$ elevation        <dbl> 881, 113, 164, 299, 806, 736, 636, 224, 52, 2240, 104…
$ eastness         <dbl> 90, -25, -84, 93, 47, -27, -48, -65, -62, -67, 96, -4…
$ northness        <dbl> 43, 96, 53, 34, -88, -96, 87, -75, 78, -74, -26, 86, …
$ roughness        <dbl> 63, 30, 13, 6, 35, 53, 3, 9, 42, 99, 51, 190, 95, 212…
$ tree_no_tree     <fct> Tree, Tree, Tree, No tree, Tree, Tree, No tree, Tree,…
$ dew_temp         <dbl> 0.04, 6.40, 6.06, 4.43, 1.06, 1.35, 1.42, 6.39, 6.50,…
$ precip_annual    <dbl> 466, 1710, 1297, 2545, 609, 539, 702, 1195, 1312, 103…
$ temp_annual_mean <dbl> 6.42, 10.64, 10.07, 9.86, 7.72, 7.89, 7.61, 10.45, 10…
$ temp_annual_min  <dbl> -8.32, 1.40, 0.19, -1.20, -5.98, -6.00, -5.76, 1.11, …
$ temp_annual_max  <dbl> 12.91, 15.84, 14.42, 15.78, 13.84, 14.66, 14.23, 15.3…
$ temp_january_min <dbl> -0.08, 5.44, 5.72, 3.95, 1.60, 1.12, 0.99, 5.54, 6.20…
$ vapor_min        <dbl> 78, 34, 49, 67, 114, 67, 67, 31, 60, 79, 172, 162, 70…
$ vapor_max        <dbl> 1194, 938, 754, 1164, 1254, 1331, 1275, 944, 892, 549…
$ canopy_cover     <dbl> 50, 79, 47, 42, 59, 36, 14, 27, 82, 12, 74, 66, 83, 6…
$ lon              <dbl> -118.6865, -123.0825, -122.3468, -121.9144, -117.8841…
$ lat              <dbl> 48.69537, 47.07991, 48.77132, 45.80776, 48.07396, 48.…
$ land_type        <fct> Tree, Tree, Tree, Tree, Tree, Tree, Non-tree vegetati…
$ county           <fct> Ferry, Thurston, Whatcom, Skamania, Stevens, Stevens,…

Read the documentation:

?forested

Goal

  • Use the data we’ve already seen to predict if a yet-to-be-observed plot of land is forested;

  • We want a model that does well on data it has never seen before;

  • “Out-of-sample” predictions on new data are more useful than “in-sample” predictions on old data;

Training versus testing data

To mimic this “out-of-sample” idea, we randomly split the data into two parts:

  • training data: this is what the model gets to see when we fit it;
  • test data: withheld. We assess how well the trained model can predict on this data it hasn’t seen before.

Randomly split data into training and test sets

By default it’s a 75%/25% training/test split.

set.seed(8675309)

forested_split <- initial_split(forested)

forested_train <- training(forested_split)
forested_test <- testing(forested_split)

The split is random, but we want the results to be reproducible, so we “freeze the random numbers in time” by setting a seed. If we don’t tell you exactly what seed to use on an assignment, you can pick any positive integer you want.

Explore: forested or not

ggplot(forested_train, aes(x = lon, y = lat, color = forested)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("Yes" = "forestgreen", "No" = "gold2")) +
  theme_minimal()

Explore: annual precipitation

ggplot(forested_train, aes(x = lon, y = lat, color = precip_annual)) +
  geom_point(alpha = 0.7) +
  labs(color = "annual\nprecipitation\n(mm × 100)") +
  theme_minimal()

FYI: the response variable must be a factor

forested already comes as a factor, so we’re lucky:

class(forested$forested)
[1] "factor"
levels(forested$forested)
[1] "Yes" "No" 

But if it didn’t, things would not work:

logistic_reg() |>
  fit(as.numeric(forested) ~ precip_annual, data = forested_train)
Error in `check_outcome()`:
! For a classification model, the outcome should be a <factor>, not a
  double vector.

FYI: the base level is treated as “failure” (0)

The base level here is “Yes”, so “No” is treated as “success” (1):

levels(forested$forested)
[1] "Yes" "No" 

As a result, this code:

logistic_reg() |>
  fit(forested ~ precip_annual, data = forested_train)

Corresponds to this model:

\[ \widehat{\text{Prob}( \texttt{forested = "No"} \mid x)} = \frac{e^{b_0+b_1 x}}{1 + e^{b_0+b_1 x}}. \]

This is not a problem, but it means that in order to interpret the output correctly, you need to understand how your factors are leveled.

Fitting a logistic regression model

Similar syntax to linear regression:

forested_precip_fit <- logistic_reg() |>
  fit(forested ~ precip_annual, data = forested_train)

tidy(forested_precip_fit)
# A tibble: 2 × 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    1.57    0.0557         28.2 6.89e-175
2 precip_annual -0.00190 0.0000602     -31.6 5.98e-219

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.57 - 0.0019 \times precip. \]

Three equivalent representations

Version equation Friendly LHS? Friendly RHS?
Probability: (0, 1) \[\hat{p}=\frac{e^{b_0+b_1x}}{1+e^{b_0+b_1x}}\]
Odds: \((0,\,\infty)\) \[\frac{\hat{p}}{1-\hat{p}}=e^{b_0+b_1x}\] 🥴 🥴
Log-odds: \((-\infty,\,\infty)\) \[\log\left(\frac{\hat{p}}{1-\hat{p}}\right)=b_0+b_1x\]

Take your pick depending on what you’re trying to do.

Probability to log-odds

  • \(p = 0 \to \text{log-odds}=-\infty\);
  • \(p = 1/2 \to \text{log-odds}=0\);
  • \(p = 1 \to \text{log-odds}=\infty\);
  • And everything in between.

The log-odds transformation takes probabilities between 0 and 1 and streeetches them out to numbers between \(-\infty\) and \(\infty\), for which the linear model is appropriate.

Interpreting the intercept

If precip = 0, then…

\[ \widehat{\text{Prob}( \texttt{forested = "No"} \mid precip = 0)} = \frac{e^{1.57}}{1 + e^{1.57}}\approx 0.83 \]

So when \(precip = 0\), the model predicts that the probability of forested = "No" is about 83%.

Note

We didn’t interpret the intercept \(b_0=1.57\) directly. We transformed it into an interpretable quantity using the probability version of the logsitic regression model.

Interpreting the slope

Odds at \(precip\):

\[ \frac{\hat{p}}{1-\hat{p}} = {\color{blue}{e^{1.57 - 0.0019 \times precip}}} \]

Odds at \(precip + 1\):

\[ \begin{aligned} \frac{\hat{p}}{1-\hat{p}} &= e^{1.57 - 0.0019 \times (precip + 1)} \\ &= e^{1.57 - 0.0019 \times precip - 0.0019} \\ &= {\color{blue}{e^{1.57 - 0.0019 \times precip}}} \times \color{red}{e^{-0.0019}} \end{aligned} \]

If \(precip\) increases by one unit, the model predicts a decrease in the odds that forested = "No" by a multiplicative factor of \(e^{-0.0019}\approx 0.99\).

Generate predictions for the test data

Augment the test data frame with three new columns on the left that include model predictions (classifications and probabilities) for each row:

forested_precip_aug <- augment(forested_precip_fit, forested_test)
forested_precip_aug
# A tibble: 1,777 × 23
   .pred_class .pred_Yes .pred_No forested  year elevation eastness northness
   <fct>           <dbl>    <dbl> <fct>    <dbl>     <dbl>    <dbl>     <dbl>
 1 Yes             0.711  0.289   No        2005       164      -84        53
 2 Yes             0.968  0.0319  Yes       2003      1031      -49        86
 3 Yes             0.992  0.00806 Yes       2005      1330       99         7
 4 No              0.266  0.734   No        2014       507       44       -89
 5 No              0.263  0.737   No        2014       542      -32       -94
 6 No              0.267  0.733   No        2014       759       -2       -99
 7 No              0.232  0.768   No        2014       119        0         0
 8 No              0.241  0.759   No        2014       419       86       -49
 9 No              0.336  0.664   Yes       2014       569      -97       -21
10 No              0.279  0.721   No        2014       340      -54        83
# ℹ 1,767 more rows
# ℹ 15 more variables: roughness <dbl>, tree_no_tree <fct>, dew_temp <dbl>,
#   precip_annual <dbl>, temp_annual_mean <dbl>, temp_annual_min <dbl>,
#   temp_annual_max <dbl>, temp_january_min <dbl>, vapor_min <dbl>,
#   vapor_max <dbl>, canopy_cover <dbl>, lon <dbl>, lat <dbl>, land_type <fct>,
#   county <fct>

How did the model perform?

These are the four possibilities:

  • Our test data have the truth in the forested column;
  • We can compare the predictions in .pred_class to the true values and see how we did.

Getting the error rates

Count how many predictions fell into the four bins:

forested_precip_aug |>
  count(forested, .pred_class)
# A tibble: 4 × 3
  forested .pred_class     n
  <fct>    <fct>       <int>
1 Yes      Yes           683
2 Yes      No            290
3 No       Yes           140
4 No       No            664

Still need all the tools from before spring break!

Getting the error rates

Compute the error rates:

forested_precip_aug |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n)
  )
# A tibble: 4 × 4
# Groups:   forested [2]
  forested .pred_class     n     p
  <fct>    <fct>       <int> <dbl>
1 Yes      Yes           683 0.702
2 Yes      No            290 0.298
3 No       Yes           140 0.174
4 No       No            664 0.826

Still need all the tools from before spring break!

Getting the error rates

Label the rows so we don’t go crazy:

forested_precip_aug |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n),
    decision = case_when(
      forested == "Yes" & .pred_class == "Yes" ~ "sensitivity",
      forested == "Yes" & .pred_class == "No" ~ "false negative",
      forested == "No" & .pred_class == "Yes" ~ "false positive",
      forested == "No" & .pred_class == "No" ~ "specificity",
    )
  )
# A tibble: 4 × 5
# Groups:   forested [2]
  forested .pred_class     n     p decision      
  <fct>    <fct>       <int> <dbl> <chr>         
1 Yes      Yes           683 0.702 sensitivity   
2 Yes      No            290 0.298 false negative
3 No       Yes           140 0.174 false positive
4 No       No            664 0.826 specificity   

FYI: error rates sum to 1 within the levels f the true outcome

  • For these two, the denominator is the number of truly “zero” cases:
    • TNR: proportion of “zero” cases correctly classified as “zero;”
    • FPR: proportion of “zero” cases incorrectly classified as “one;”
  • For these two, the denominator is the number of truly “one” cases:
    • FNR: proportion of “one” cases incorrectly classified as “zero,”
    • TPR: proportion of “one” cases correctly classified as “one.”

So TNR + FPR = 1 and FNR + TPR = 1.

That’s why we did group_by(forested).

FYI: the default threshold is 50%

  • The model produces probabilities: .pred_Yes and .pred_No;

  • The concrete classifications in the .pred_class column come from applying a 50% threshold to these probabilities:

\[ \widehat{\texttt{forested}}= \begin{cases} \texttt{"No"} & \texttt{.pred\_Yes} \leq 0.5\\ \texttt{"Yes"} & \texttt{.pred\_Yes} > 0.5. \end{cases} \]

  • If you want to override that default, you must do so manually.

Change threshold to 0.00

New threshold > New classifications > New error rates

forested_precip_aug |>
  mutate(
    .pred_class = if_else(.pred_Yes <= 0.0, "No", "Yes")
  ) |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n),
  )
# A tibble: 2 × 4
# Groups:   forested [2]
  forested .pred_class     n     p
  <fct>    <chr>       <int> <dbl>
1 Yes      Yes           973     1
2 No       Yes           804     1

Change threshold to 0.25

New threshold > New classifications > New error rates

forested_precip_aug |>
  mutate(
    .pred_class = if_else(.pred_Yes <= 0.25, "No", "Yes")
  ) |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n),
  )
# A tibble: 3 × 4
# Groups:   forested [2]
  forested .pred_class     n     p
  <fct>    <chr>       <int> <dbl>
1 Yes      Yes           973 1    
2 No       No            179 0.223
3 No       Yes           625 0.777

Change threshold to 0.50

New threshold > New classifications > New error rates

forested_precip_aug |>
  mutate(
    .pred_class = if_else(.pred_Yes <= 0.50, "No", "Yes")
  ) |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n),
  )
# A tibble: 4 × 4
# Groups:   forested [2]
  forested .pred_class     n     p
  <fct>    <chr>       <int> <dbl>
1 Yes      No            290 0.298
2 Yes      Yes           683 0.702
3 No       No            664 0.826
4 No       Yes           140 0.174

Change threshold to 0.75

New threshold > New classifications > New error rates

forested_precip_aug |>
  mutate(
    .pred_class = if_else(.pred_Yes <= 0.75, "No", "Yes")
  ) |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n),
  )
# A tibble: 4 × 4
# Groups:   forested [2]
  forested .pred_class     n      p
  <fct>    <chr>       <int>  <dbl>
1 Yes      No            474 0.487 
2 Yes      Yes           499 0.513 
3 No       No            731 0.909 
4 No       Yes            73 0.0908

Change threshold to 1.00

New threshold > New classifications > New error rates

forested_precip_aug |>
  mutate(
    .pred_class = if_else(.pred_Yes <= 1.00, "No", "Yes")
  ) |>
  count(forested, .pred_class) |>
  group_by(forested) |>
  mutate(
    p = n / sum(n),
  )
# A tibble: 2 × 4
# Groups:   forested [2]
  forested .pred_class     n     p
  <fct>    <chr>       <int> <dbl>
1 Yes      No            973     1
2 No       No            804     1

Picture how errors change with threshold (th)

Picture how errors change with threshold (th)

But that was tedious

Let’s do “all” of the thresholds and connect the dots:

forested_precip_roc <- roc_curve(forested_precip_aug, 
                                 truth = forested, 
                                 .pred_Yes, 
                                 event_level = "first")
forested_precip_roc
# A tibble: 1,180 × 3
   .threshold specificity sensitivity
        <dbl>       <dbl>       <dbl>
 1   -Inf         0                 1
 2      0.224     0                 1
 3      0.225     0.00124           1
 4      0.228     0.00249           1
 5      0.228     0.00498           1
 6      0.229     0.00622           1
 7      0.229     0.00746           1
 8      0.230     0.00995           1
 9      0.230     0.0124            1
10      0.231     0.0137            1
# ℹ 1,170 more rows

Unpacking roc_curve

Recall the augmented data frame:

forested_precip_aug |>
  select(.pred_class:forested)
# A tibble: 1,777 × 4
   .pred_class .pred_Yes .pred_No forested
   <fct>           <dbl>    <dbl> <fct>   
 1 Yes             0.711  0.289   No      
 2 Yes             0.968  0.0319  Yes     
 3 Yes             0.992  0.00806 Yes     
 4 No              0.266  0.734   No      
 5 No              0.263  0.737   No      
 6 No              0.267  0.733   No      
 7 No              0.232  0.768   No      
 8 No              0.241  0.759   No      
 9 No              0.336  0.664   Yes     
10 No              0.279  0.721   No      
# ℹ 1,767 more rows

Recall how forested is leveled:

levels(forested_precip_aug$forested)
[1] "Yes" "No" 
roc_curve(forested_precip_aug, 
          truth = forested, 
          .pred_Yes, 
          event_level = "first")
  1. Supply augmented data frame;
  2. Tell it which column contains true classifications;
  3. Tell it which column contains the predicted probabilities for the level you care about
  4. Tell it which level of truth is the one you care about

Warning

Those last two arguments need to match!

The ROC curve

ggplot(forested_precip_roc, aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() + 
  theme_minimal()

The ROC curve

  • ROC stands for receiver operating characteristic;

  • This visualizes the classification accuracy across a range of thresholds;

  • The more “up and to the left” it is, the better.

  • We can quantify “up and to the left” with the area under the curve (AUC).

The ROC curve

AUC = 1

This is the best we could possibly do:

AUC = 1 / 2

Don’t waste time fitting a model. Just flip a coin:

AUC = 0

This is the worst we could possibly do:

AUC for the model we fit

roc_auc(
  forested_precip_aug, 
  truth = forested, 
  .pred_Yes, 
  event_level = "first"
)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.879

Not bad!

The area under the ROC curve

  • This is a “quality score” for a logistic regression model;

  • When you compute it for a test data set that you set aside, the AUC is a measure of out-of-sample prediction accuracy;

  • AUC is a number between 0 and 1, where 0 is awful and 1 is great, similar to \(R^2\) for linear regression.

  • The function roc_auc computes it for you, and it takes the same set of arguments as roc_curve.

New commands

  • logistic_reg
  • augment
  • roc_curve
  • roc_auc
  • set.seed
  • initial_split, training, test
  • geom_path

Application exercise

Data

Data: chicago_taxi

chicago_taxi <- read_csv("data/chicago-taxi.csv")
chicago_taxi
# A tibble: 2,000 × 7
   tip   distance company                      local dow   month  hour
   <chr>    <dbl> <chr>                        <chr> <chr> <chr> <dbl>
 1 no        0.4  other                        yes   Fri   Mar      17
 2 no        0.96 Taxicab Insurance Agency Llc no    Mon   Apr       8
 3 no        1.07 Sun Taxi                     no    Fri   Feb      15
 4 no        1.13 other                        no    Sat   Feb      14
 5 no       10.8  Sun Taxi                     no    Sat   Apr      14
 6 no        3.6  Chicago Independents         no    Wed   Mar      12
 7 no        1.08 Flash Cab                    yes   Wed   Mar      13
 8 no        0.85 Taxicab Insurance Agency Llc no    Tue   Mar       8
 9 no       17.9  City Service                 no    Tue   Jan      20
10 no        0    City Service                 yes   Sun   Apr      16
# ℹ 1,990 more rows

Data: chicago_taxi

glimpse(chicago_taxi)
Rows: 2,000
Columns: 7
$ tip      <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
$ distance <dbl> 0.40, 0.96, 1.07, 1.13, 10.81, 3.60, 1.08, 0.85, 17.92, 0.00,…
$ company  <chr> "other", "Taxicab Insurance Agency Llc", "Sun Taxi", "other",…
$ local    <chr> "yes", "no", "no", "no", "no", "no", "yes", "no", "no", "yes"…
$ dow      <chr> "Fri", "Mon", "Fri", "Sat", "Sat", "Wed", "Wed", "Tue", "Tue"…
$ month    <chr> "Mar", "Apr", "Feb", "Feb", "Apr", "Mar", "Mar", "Mar", "Jan"…
$ hour     <dbl> 17, 8, 15, 14, 14, 12, 13, 8, 20, 16, 20, 8, 15, 15, 12, 0, 1…

Data prep

chicago_taxi <- chicago_taxi |>
  mutate(
    tip = fct_relevel(tip, "no", "yes"),
    local = fct_relevel(local, "no", "yes"),
    dow = fct_relevel(dow, "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
    month = fct_relevel(month, "Jan", "Feb", "Mar", "Apr")
  )

Outcome and predictors

  • Outcome - tip: Whether rider left a tip – a factor w/ levels “yes” and “no”
  • (Potential) predictors:
    • Numerical:
      • distance: The trip distance, in odometer miles.
      • hour: The hour of the day in which the trip began.
    • Categorical:
      • company: The taxi company – companies that occurred few times were binned as “other”.
      • local: Whether the trip started in the same community area as it began.
      • dow: The day of the week in which the trip began.
      • month: The month in which the trip began.

Should we include a predictor?

To determine whether we should include a predictor in a model, we should start by asking:

  • Is it ethical to use this variable? (Or even legal?)

  • Will this variable be available at prediction time?

  • Does this variable contribute to explainability? How do we evaluate that?

The initial split

set.seed(20260406)
chicago_taxi_split <- initial_split(chicago_taxi)
chicago_taxi_split
<Training/Testing/Total>
<1500/500/2000>

Accessing the data

chicago_taxi_train <- training(chicago_taxi_split)
chicago_taxi_test <- testing(chicago_taxi_split)

The training set

chicago_taxi_train
# A tibble: 1,500 × 7
   tip   distance company                   local dow   month  hour
   <fct>    <dbl> <chr>                     <fct> <fct> <fct> <dbl>
 1 no        1.75 Sun Taxi                  no    Thu   Mar      10
 2 no        2.85 Flash Cab                 no    Thu   Mar       9
 3 yes       0.7  Taxi Affiliation Services yes   Tue   Jan      15
 4 yes       1.5  other                     no    Sat   Apr      13
 5 no        8.99 Sun Taxi                  no    Tue   Feb      20
 6 no       18.4  Sun Taxi                  no    Sat   Feb      14
 7 yes       1.09 Chicago Independents      no    Wed   Jan      16
 8 yes       1.61 other                     yes   Fri   Feb      20
 9 yes       4.77 Chicago Independents      no    Sat   Apr      14
10 yes       0.54 City Service              no    Mon   Jan      15
# ℹ 1,490 more rows

The testing data

🙈

Initial questions

  • What’s the distribution of the outcome, tip?

  • What’s the distribution of numeric variables like distance or hour?

  • How does the distribution of tip differ across the categorical and numerical variables?

tip

What’s the distribution of the outcome, tip?

chicago_taxi_train |>
  count(tip) |>
  mutate(prop = n / sum(n))
# A tibble: 2 × 3
  tip       n  prop
  <fct> <int> <dbl>
1 no      589 0.393
2 yes     911 0.607

distance

What’s the distribution of distance?

ggplot(chicago_taxi_train, aes(x = distance)) +
  geom_histogram(binwidth = 3)

tip and distance

ggplot(chicago_taxi_train, aes(x = distance, fill = tip, group = tip)) +
  geom_histogram(binwidth = 3, show.legend = FALSE) +
  scale_fill_manual(values = c("yes" = "darkolivegreen4", "no" = "darkgray")) +
  facet_wrap(~tip, ncol = 1, labeller = label_both, scales = "free_y")

tip and hour

ggplot(chicago_taxi_train, aes(x = hour, fill = tip, group = tip)) +
  geom_histogram(binwidth = 1, show.legend = FALSE) +
  scale_fill_manual(values = c("yes" = "darkolivegreen4", "no" = "darkgray")) +
  scale_x_continuous(breaks = seq(0, 18, by = 6)) +
  facet_wrap(~tip, ncol = 1, labeller = label_both, scales = "free_y")

tip and local

ggplot(chicago_taxi_train, aes(x = local, fill = tip)) +
  geom_bar(position = "fill", color = "white") +
  scale_fill_manual(values = c("yes" = "darkolivegreen4", "no" = "darkgray")) +
  labs(y = "Proportion") +
  theme(legend.position = "top")

tip and dow

ggplot(chicago_taxi_train, aes(x = dow, fill = tip)) +
  geom_bar(position = "fill", color = "white") +
  scale_fill_manual(values = c("yes" = "darkolivegreen4", "no" = "darkgray")) +
  labs(y = "Proportion", x = "Day of week") +
  theme(legend.position = "top")