Data Science Wrap-up

Lecture 12

John Zito

Duke University
STA 199 Spring 2026

2026-02-23

While you wait: Participate 📱💻

Did your container stop working over the weekend?

  • Yes;
  • No;
  • idk. I didn’t use it this weekend.

Scan the QR code or go HERE. Log in with your Duke NetID.

Reminders

Schedule changes

All of the containers should be back online. Even so…

  • Project proposals due Tue Feb 24 @ 11:59 pm;
  • No lab Thu Feb 26;
  • Proposal feedback returned Mon Mar 2;
  • Make progress by Wed March 4 @ 11:59 pm;
  • Peer review in lab Thu Mar 5;
  • Spring break!

Midterm Exam 1

Worth 20% of your final grade; consists of two parts:

  • In-class: worth 80% of the Midterm 1 grade;

    • Wednesday February 25 11:45 AM - 1:00 PM in this room;
  • Take-home: worth 20% of the Midterm 1 grade.

    • Released Thursday February 26 at 6:00 pm;
    • Due Sunday March 1 @ 11:59 pm.

If you take every exam and do better on the final than you did on this, we replace it.

In-class

  • All multiple choice;

  • There were 35 questions last spring, some of which I included on the study guide;

  • You get both sides of one 8.5” x 11” note sheet created by you and you alone;

    • You can create it however you want (written, typed, iPad, etc);
  • If you seek testing accommodations but haven’t documented them with the SDAO and made a Testing Center appointment, it may be too late, but try ASAP.

What should I put on my cheat sheet?

Ask one of our undergrad TAs! They took the class. I didn’t.

  • description of common functions;
  • description of different visualizations: how to interpret, and what to use when;
  • cute doodles;
  • words of affirmation.

Warning

Don’t waste space on the details of any specific applications or datasets we’ve seen (penguins, Bechdel, gerrymandering, midwest, etc). Anything we want you to know about a particular application will be introduced from scratch within the exam.

Take-home

  • It will be just like a homework, only shorter;
  • Completely open-resource, but citation policies apply;
  • Absolutely no collaboration of any kind;
  • Seek help by posting privately on Ed;
    • OH are canceled while tha take-home is live;
  • Submit your final PDF to Gradescope in the usual way.

Things you can do to study

In order of importance:

  • Study guide
  • Work on your cheat sheet;
  • Correct old labs and homeworks;
  • Old AEs: complete tasks we didn’t get to and compare with key;
  • Old HWs: complete any unfinished “Feedback from AI” problems;
  • Code along: watch these videos specifically;
  • Textbook: odd-numbered exercises in the back of Chs. 1, 4, 5, 6.

Our conduct policies

  • Inappropriate collaboration will result in a zero on the entire take-home and be referred to the conduct office;
  • That zero will not be dropped or replaced;
  • If a conduct violation of any kind is discovered, your final letter grade in the course will be permanently reduced (A- down to B+, B+ down to B, etc);
  • If folks share solutions, all students involved will be penalized equally, the sharer the same as the recipient.

Duke’s conduct policies

Let’s say we catch you:

  • If you are a first-time offender, your case goes to “faculty-student resolution”;
  • If you sign your name, you admit to wrongdoing, and agree to accept the policies outlined in my syllabus;
  • If you do not re-offend, Duke promises that this first offense will not appear on your transcript and is not reportable outside the university.

Except…

  • Medical school applicants are compelled to self-report any prior cases of misconduct, even if they are unreportable and do not appear on a transcript;

  • This is what the AAMC calls an “Institutional Action:”

    If you were ever the recipient of any institutional action by any college or medical school for…a conduct violation, you must answer Yes…even if such action did not: interrupt your enrollment, require you to withdraw, or appear on your official transcripts.

  • Duke prehealth advising has a whole page on this.

Let’s zoom out for a sec…

Data science and statistical thinking

Before spring break…

  • Data science: the real-world art of transforming messy, imperfect, incomplete data into knowledge;

After spring break…

  • Statistics: the mathematical discipline of quantifying our uncertainty about that knowledge.

Data science

Data science

  1. Collection: we won’t seriously study this!

    • for us: data importing (read_csv etc) or webscraping;
    • but really: domain-specific issues of measurement, survey design, experimental design, etc;

Data collection

I sent out my lil’ survey with Google Forms, downloaded the responses in a CSV, and read that sucker in:

survey <- read_csv("data/survey-2026-02-09.csv")
survey
# A tibble: 276 × 3
   Timestamp         How many classes do you have on Tues…¹ `What year are you?`
   <chr>             <chr>                                  <chr>               
 1 2/9/2026 11:03:46 3                                      First-year          
 2 2/9/2026 11:29:24 2                                      Sophomore           
 3 2/9/2026 11:33:44 2                                      Sophomore           
 4 2/9/2026 11:33:48 2                                      Sophomore           
 5 2/9/2026 11:33:56 1                                      First-year          
 6 2/9/2026 11:33:56 3                                      First-year          
 7 2/9/2026 11:33:58 3                                      Sophomore           
 8 2/9/2026 11:34:07 3                                      Sophomore           
 9 2/9/2026 11:34:13 2                                      First-year          
10 2/9/2026 11:34:20 3                                      Junior              
# ℹ 266 more rows
# ℹ abbreviated name: ¹​`How many classes do you have on Tuesdays?`

Data science

  1. Collection: we won’t seriously study this!

    • for us: data importing (read_csv) or webscraping;
    • but really: domain-specific issues of measurement, survey design, experimental design, etc;
  1. Preparation: cleaning, wrangling, and otherwise tidying the data so we can actually work with it.

    • keywords: mutate, fct_relevel, pivot_*, *_join

Data preparation

survey <- survey |>
  rename(
    tue_classes = `How many classes do you have on Tuesdays?`,
    year = `What year are you?`
  ) |>
  mutate(
    tue_classes = case_when(
      tue_classes == "1 class, 1 lab, 1 volunteering session" ~ "2",
      tue_classes == "One" ~ "1",
      tue_classes == "Three" ~ "3",
      tue_classes == "Two" ~ "2",
      tue_classes == "Two in both days" ~ "2",
      tue_classes == "two" ~ "2",
      .default = tue_classes
    ),
    tue_classes = as.numeric(tue_classes),
    year = fct_relevel(year, "First-year", "Sophomore", "Junior", "Senior")
  ) |>
  select(tue_classes, year)
survey
# A tibble: 276 × 2
   tue_classes year      
         <dbl> <fct>     
 1           3 First-year
 2           2 Sophomore 
 3           2 Sophomore 
 4           2 Sophomore 
 5           1 First-year
 6           3 First-year
 7           3 Sophomore 
 8           3 Sophomore 
 9           2 First-year
10           3 Junior    
# ℹ 266 more rows

Data science

  1. Collection: we won’t seriously study this!

    • for us: data importing (read_csv) or webscraping;
    • but really: domain-specific issues of measurement, survey design, experimental design, etc;
  2. Preparation: cleaning, wrangling, and otherwise tidying the data so we can actually work with it.

    • keywords: mutate, fct_relevel, pivot_*, *_join
  1. Analysis: finally transform the data into knowledge

    • pictures: ggplot, geom_*, etc
    • numerical summaries: summarize, group_by, count, mean, median, sd, quantile, IQR, cor, etc

Data analysis

A human being can learn nothing from staring at this box:

survey
# A tibble: 276 × 2
   tue_classes year      
         <dbl> <fct>     
 1           3 First-year
 2           2 Sophomore 
 3           2 Sophomore 
 4           2 Sophomore 
 5           1 First-year
 6           3 First-year
 7           3 Sophomore 
 8           3 Sophomore 
 9           2 First-year
10           3 Junior    
# ℹ 266 more rows

Data analysis

Picture!

ggplot(survey, aes(x = tue_classes, fill = year)) + 
  geom_bar(position = "dodge")

Data analysis

Better picture?

ggplot(survey, aes(x = tue_classes, fill = year)) + 
  geom_bar(position = "fill")

Data analysis

Numbers!

survey |>
  count(tue_classes, year) |>
  group_by(tue_classes) |>
  mutate(prop = n / sum(n))
# A tibble: 19 × 4
# Groups:   tue_classes [5]
   tue_classes year           n   prop
         <dbl> <fct>      <int>  <dbl>
 1           0 First-year     4 0.308 
 2           0 Sophomore      5 0.385 
 3           0 Junior         2 0.154 
 4           0 Senior         2 0.154 
 5           1 First-year    28 0.438 
 6           1 Sophomore     23 0.359 
 7           1 Junior        11 0.172 
 8           1 Senior         2 0.0312
 9           2 First-year    58 0.464 
10           2 Sophomore     46 0.368 
11           2 Junior        16 0.128 
12           2 Senior         5 0.04  
13           3 First-year    30 0.448 
14           3 Sophomore     21 0.313 
15           3 Junior        14 0.209 
16           3 Senior         2 0.0299
17           4 First-year     3 0.429 
18           4 Sophomore      1 0.143 
19           4 Junior         3 0.429 

Data science

  1. Collection: we won’t seriously study this!

    • for us: data importing (read_csv) or webscraping;
    • but really: domain-specific issues of measurement, survey design, experimental design, etc;
  2. Preparation: cleaning, wrangling, and otherwise tidying the data so we can actually work with it.

    • keywords: mutate, fct_relevel, pivot_*, *_join
  3. Analysis: finally transform the data into knowledge

    • pictures: ggplot, geom_*, etc
    • numerical summaries: summarize, group_by, count, mean, median, sd, quantile, iqr, cor, etc

The pictures and the summaries need to work together!

A cautionary tale: Anscombe’s quartet

Dataset I

    x     y
1  10  8.04
2   8  6.95
3  13  7.58
4   9  8.81
5  11  8.33
6  14  9.96
7   6  7.24
8   4  4.26
9  12 10.84
10  7  4.82
11  5  5.68

Dataset II

    x    y
1  10 9.14
2   8 8.14
3  13 8.74
4   9 8.77
5  11 9.26
6  14 8.10
7   6 6.13
8   4 3.10
9  12 9.13
10  7 7.26
11  5 4.74

Dataset III

    x     y
1  10  7.46
2   8  6.77
3  13 12.74
4   9  7.11
5  11  7.81
6  14  8.84
7   6  6.08
8   4  5.39
9  12  8.15
10  7  6.42
11  5  5.73

Dataset IV

    x     y
1   8  6.58
2   8  5.76
3   8  7.71
4   8  8.84
5   8  8.47
6   8  7.04
7   8  5.25
8  19 12.50
9   8  5.56
10  8  7.91
11  8  6.89

A cautionary tale: Anscombe’s quartet

ggplot(anscombe_tidy, aes(x, y)) +
  geom_point() +
  facet_wrap(~ set) 

A cautionary tale: Anscombe’s quartet

ggplot(anscombe_tidy, aes(x, y)) +
  geom_point() +
  facet_wrap(~ set) +
  geom_smooth(method = "lm", se = FALSE)

If you only looked at summary statistics…

anscombe_tidy |>
  group_by(set) |>
  summarize(
    xbar = mean(x),
    ybar = mean(y),
    sx = sd(x),
    sy = sd(y),
    r = cor(x, y)
  )
# A tibble: 4 × 6
  set    xbar  ybar    sx    sy     r
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 I         9  7.50  3.32  2.03 0.816
2 II        9  7.50  3.32  2.03 0.816
3 III       9  7.5   3.32  2.03 0.816
4 IV        9  7.50  3.32  2.03 0.817

Our motto: ABV!

No, not alcohol by volume…

  • Always!
  • Be!
  • Visualizing!

Study guide

Can you match these…

…with these?

Matching scatterplots and histograms

Dataframe with two numerical variables:

df
# A tibble: 100 × 2
        x       y
    <dbl>   <dbl>
 1 -0.554 -0.885 
 2  1.58   0.497 
 3  1.68  -0.135 
 4  1.69  -0.175 
 5 -0.652  0.331 
 6  1.59   0.0341
 7 -2.23  -0.174 
 8  0.571 -1.00  
 9  1.30  -1.09  
10  1.62   0.722 
# ℹ 90 more rows

Histograms for each:

ggplot(df, aes(x = x)) + 
  geom_histogram()

ggplot(df, aes(x = y)) + 
  geom_histogram()

Scatterplot for both:

ggplot(df, aes(x = x, y = y)) + 
  geom_point()

2 goes with e

Everything is symmetric and concentrating in one place:

7 goes with f

x and y each concentrate in two places:

3 goes with d

x concentrates in two places, but y only concentrates in one:

6 goes with c

Points are evenly spread everywhere:

4 goes with b

Asymetry in y but not in x:

5 goes with a

This one’s just weird. By process of elimination…

Can you match these…

…with these?

Yes you can

If you know how the box is drawn:

  • The line inside the box is the median (50% quantile);
  • The left edge is the 25% quantile;
  • The right edge is the 75% quantile;
  • The box spans the middle 50% of the data;
  • The whiskers and dots span the entire range of the data;

Boxplots display center, spread, and (a)symmetry, but not modality.

Whence the box?

The middle of the box is the median. 50% of the data are below, and 50% are above:

Whence the box?

The lower edge of the box is the 25% quantile. 25% of the data are below, and 75% are above:

Whence the box?

The upper edge of the box is the 75% quantile. 75% of the data are below, and 25% are above:

13 goes with d

The only one concentrating near 2:

12 goes with c

Concentrating around -2 and symmetric:

9 goes with a

Concentrating around -2 and asymmetric:

8 goes with f

Multimodality is a distraction. Pay attention to the whiskers.

10 goes with b

Also centered at 0, but more spread out than 8:

11 goes with e

Also centered at 0, but how wide must you make the box to swallow the middle 50%? Pretty darn wide: