More data types and classes

Lecture 9

Author
Affiliation

John Zito

Duke University
STA 199 Spring 2026

Published

February 11, 2026

Warm up

While you wait: Participate 📱💻

If you create a factor variable, but you do no explicitly give the levels of your factor an order, what default does R apply?

  • chronological;
  • the order in which the levels first appear in the data frame;
  • alphabetical;
  • no default. it will throw an error.

Scan the QR code or go HERE. Log in with your Duke NetID.

Midterm 1

. . .

It’s two weeks from today!

. . .

I’ll say more next week when I post the study guide.

Project starts next Thursday

  • 15% of your final course grade;
  • Instructor randomly assign teams within lab, taking into account the domain interests (econ, health, sports, etc) you expressed on the “Getting to Know You” survey:
  • Important dates:
    • Thu Feb 19 - how to collaborate with git/github, start proposal;
    • Sun Feb 22 - submit proposal;
    • Thu Feb 26 - work period;
    • Thu Mar 5 - peer review, work period
    • Thu Mar 26 - final presentation;
    • Sun Mar 29 - final submission.
  • You will evaluate your peers throughout, and good collaboration is part of the project grade.

GitHub workflow

Render, Commit, and Push early and often! We actually assign points to this on lab and HW:

  • At least 3 commits to .qmd file: (out of 3 points)
  • At least 1 commit to .pdf file: (out of 2 points)
  • Git email correctly configured: (out of 2 points)

GitHub workflow

As a courtesy to your teammates, practice good habits!

  • Render: make sure your stuff actually runs without error. People get mad if you send them errors;
  • Commit: leave a well-documented trail of breadcrumbs so that teammates (and your future self!) know what you changed and why;
  • Push: backup your work in the cloud (GitHub) so it isn’t lost.

Code smell

One way to look at smells is with respect to principles and quality: “Smells are certain structures in the code that indicate violation of fundamental design principles and negatively impact design quality”. Code smells are usually not bugs; they are not technically incorrect and do not prevent the program from functioning. Instead, they indicate weaknesses in design that may slow down development or increase the risk of bugs or failures in the future.

Code style

Follow the Tidyverse style guide:

  • Spaces before and line breaks after each + when building a ggplot

  • Spaces before and line breaks after each |> in a data transformation pipeline,

  • Proper indentation

  • Spaces around = signs and spaces after commas

  • Lines should not span more than 80 characters, long lines should be broken up with each argument on its own line

FAQ

Quotes VS no quotes VS backticks

. . .

df <- tibble(
  x = c(-2, -0.5, 0.5, 1, 2),
  `2011` = c(-2, -0.5, 0.5, 1, 2),
  `my var` = c(-2, -1, 0, 1, 2)
)
df
# A tibble: 5 × 3
      x `2011` `my var`
  <dbl>  <dbl>    <dbl>
1  -2     -2         -2
2  -0.5   -0.5       -1
3   0.5    0.5        0
4   1      1          1
5   2      2          2

Quotes VS no quotes VS backticks

df <- tibble(
  x = c(-2, -0.5, 0.5, 1, 2),
  `2011` = c(-2, -0.5, 0.5, 1, 2),
  `my var` = c(-2, -1, 0, 1, 2)
)

Referencing a column in a pipeline:

df |>
  filter("x" > 0)
# A tibble: 5 × 3
      x `2011` `my var`
  <dbl>  <dbl>    <dbl>
1  -2     -2         -2
2  -0.5   -0.5       -1
3   0.5    0.5        0
4   1      1          1
5   2      2          2

"x" means the literal character string.

df |>
  filter(x > 0)
# A tibble: 3 × 3
      x `2011` `my var`
  <dbl>  <dbl>    <dbl>
1   0.5    0.5        0
2   1      1          1
3   2      2          2

x means the column name in df.

df |>
  filter(`x` > 0)
# A tibble: 3 × 3
      x `2011` `my var`
  <dbl>  <dbl>    <dbl>
1   0.5    0.5        0
2   1      1          1
3   2      2          2

`x` also means the column name in df.

Quotes VS no quotes VS backticks

df <- tibble(
  x = c(-2, -0.5, 0.5, 1, 2),
  `2011` = c(-2, -0.5, 0.5, 1, 2),
  `my var` = c(-2, -1, 0, 1, 2)
)

Referencing a column in a pipeline:

df |>
  filter("2011" > 0)
# A tibble: 5 × 3
      x `2011` `my var`
  <dbl>  <dbl>    <dbl>
1  -2     -2         -2
2  -0.5   -0.5       -1
3   0.5    0.5        0
4   1      1          1
5   2      2          2

"2011" means the literal character string.

df |>
  filter(2011 > 0)
# A tibble: 5 × 3
      x `2011` `my var`
  <dbl>  <dbl>    <dbl>
1  -2     -2         -2
2  -0.5   -0.5       -1
3   0.5    0.5        0
4   1      1          1
5   2      2          2

2011 means the literal number.

df |>
  filter(`2011` > 0)
# A tibble: 3 × 3
      x `2011` `my var`
  <dbl>  <dbl>    <dbl>
1   0.5    0.5        0
2   1      1          1
3   2      2          2

`2011` means the column name in df.

Quotes VS no quotes VS backticks

df <- tibble(
  x = c(-2, -0.5, 0.5, 1, 2),
  `2011` = c(-2, -0.5, 0.5, 1, 2),
  `my var` = c(-2, -1, 0, 1, 2)
)

Referencing a column in a pipeline:

df |>
  filter("my var" > 0)
# A tibble: 5 × 3
      x `2011` `my var`
  <dbl>  <dbl>    <dbl>
1  -2     -2         -2
2  -0.5   -0.5       -1
3   0.5    0.5        0
4   1      1          1
5   2      2          2

"my var" means the literal character string.

df |>
  filter(my var > 0)
Error in parse(text = input): <text>:2:13: unexpected symbol
1: df |>
2:   filter(my var
               ^

my var means nothing.

df |>
  filter(`my var` > 0)
# A tibble: 2 × 3
      x `2011` `my var`
  <dbl>  <dbl>    <dbl>
1     1      1        1
2     2      2        2

`my var` means the column name in df.

Why %in% instead of ==?

. . .

Consider adding a season column:

durham_climate
# A tibble: 12 × 4
   month     avg_high_f avg_low_f precipitation_in
   <chr>          <dbl>     <dbl>            <dbl>
 1 January           49        28             4.45
 2 February          53        29             3.7 
 3 March             62        37             4.69
 4 April             71        46             3.43
 5 May               79        56             4.61
 6 June              85        65             4.02
 7 July              89        70             3.94
 8 August            87        68             4.37
 9 September         81        60             4.37
10 October           71        47             3.7 
11 November          62        37             3.39
12 December          53        30             3.43

Why %in% instead of ==?

Consider adding a season column:

durham_climate |>
  mutate(
    season = if_else(
      month ????? c("December", "January", "February"),
      "Winter",
      "Not Winter"
    )
  )

Why %in% instead of ==?

Consider adding a season column:

durham_climate |>
  mutate(
    season = if_else(
      month %in% c("December", "January", "February"),
      "Winter",
      "Not Winter"
    )
  )
# A tibble: 12 × 5
   month     avg_high_f avg_low_f precipitation_in season    
   <chr>          <dbl>     <dbl>            <dbl> <chr>     
 1 January           49        28             4.45 Winter    
 2 February          53        29             3.7  Winter    
 3 March             62        37             4.69 Not Winter
 4 April             71        46             3.43 Not Winter
 5 May               79        56             4.61 Not Winter
 6 June              85        65             4.02 Not Winter
 7 July              89        70             3.94 Not Winter
 8 August            87        68             4.37 Not Winter
 9 September         81        60             4.37 Not Winter
10 October           71        47             3.7  Not Winter
11 November          62        37             3.39 Not Winter
12 December          53        30             3.43 Winter    

Why %in% instead of ==?

Consider adding a season column:

durham_climate |>
  mutate(
    season = if_else(
      month == c("December", "January", "February"),
      "Winter",
      "Not Winter"
    )
  )
# A tibble: 12 × 5
   month     avg_high_f avg_low_f precipitation_in season    
   <chr>          <dbl>     <dbl>            <dbl> <chr>     
 1 January           49        28             4.45 Not Winter
 2 February          53        29             3.7  Not Winter
 3 March             62        37             4.69 Not Winter
 4 April             71        46             3.43 Not Winter
 5 May               79        56             4.61 Not Winter
 6 June              85        65             4.02 Not Winter
 7 July              89        70             3.94 Not Winter
 8 August            87        68             4.37 Not Winter
 9 September         81        60             4.37 Not Winter
10 October           71        47             3.7  Not Winter
11 November          62        37             3.39 Not Winter
12 December          53        30             3.43 Not Winter

Why %in% instead of ==?

"January" == c("December", "January", "February")
[1] FALSE  TRUE FALSE
"January" %in% c("December", "January", "February")
[1] TRUE
NotePunchline

Inside if_else or case_when your condition needs to result in a single value of TRUE or FALSE for each row. If it results in multiple values of TRUE/FALSE (a vector of TRUE/FALSE), you will not necessarily get an error or even a warning, but unexpected things could happen.

Picking up from last time

AE 08:

  • Go to your ae project in RStudio;

  • Open ae-08-durham-climate-factors.qmd.