Data Science Wrap-up

Lecture 12

John Zito

Duke University
STA 199 Spring 2026

2026-02-23

While you wait: Participate 📱💻

Did your container stop working over the weekend?

Yes;
No;
idk. I didn’t use it this weekend.

Scan the QR code or go HERE. Log in with your Duke NetID.

Reminders

Schedule changes

All of the containers should be back online. Even so…

Project proposals due Tue Feb 24 @ 11:59 pm;
No lab Thu Feb 26;
Proposal feedback returned Mon Mar 2;
Make progress by Wed March 4 @ 11:59 pm;
Peer review in lab Thu Mar 5;
Spring break!

Midterm Exam 1

Worth 20% of your final grade; consists of two parts:

In-class: worth 80% of the Midterm 1 grade;
- Wednesday February 25 11:45 AM - 1:00 PM in this room;
Take-home: worth 20% of the Midterm 1 grade.
- Released Thursday February 26 at 6:00 pm;
- Due Sunday March 1 @ 11:59 pm.

If you take every exam and do better on the final than you did on this, we replace it.

In-class

All multiple choice;
There were 35 questions last spring, some of which I included on the study guide;
You get both sides of one 8.5” x 11” note sheet created by you and you alone;
- You can create it however you want (written, typed, iPad, etc);
If you seek testing accommodations but haven’t documented them with the SDAO and made a Testing Center appointment, it may be too late, but try ASAP.

What should I put on my cheat sheet?

Ask one of our undergrad TAs! They took the class. I didn’t.

description of common functions;
description of different visualizations: how to interpret, and what to use when;
cute doodles;
words of affirmation.

Warning

Don’t waste space on the details of any specific applications or datasets we’ve seen (penguins, Bechdel, gerrymandering, midwest, etc). Anything we want you to know about a particular application will be introduced from scratch within the exam.

Take-home

It will be just like a homework, only shorter;
Completely open-resource, but citation policies apply;
Absolutely no collaboration of any kind;
Seek help by posting privately on Ed;
- OH are canceled while tha take-home is live;
Submit your final PDF to Gradescope in the usual way.

Things you can do to study

In order of importance:

Study guide
Work on your cheat sheet;
Correct old labs and homeworks;
Old AEs: complete tasks we didn’t get to and compare with key;
Old HWs: complete any unfinished “Feedback from AI” problems;
Code along: watch these videos specifically;
Textbook: odd-numbered exercises in the back of Chs. 1, 4, 5, 6.

Our conduct policies

Inappropriate collaboration will result in a zero on the entire take-home and be referred to the conduct office;
That zero will not be dropped or replaced;
If a conduct violation of any kind is discovered, your final letter grade in the course will be permanently reduced (A- down to B+, B+ down to B, etc);
If folks share solutions, all students involved will be penalized equally, the sharer the same as the recipient.

Duke’s conduct policies

Let’s say we catch you:

If you are a first-time offender, your case could go to “faculty-student resolution”;
If you sign your name, you admit to wrongdoing, and agree to accept the policies outlined in my syllabus;
If you do not re-offend, Duke promises that this first offense will not appear on your transcript and is not reportable outside the university.

Except…

Medical school applicants are compelled to self-report any prior cases of misconduct, even if they are unreportable and do not appear on a transcript;
This is what the AAMC calls an “Institutional Action:”

If you were ever the recipient of any institutional action by any college or medical school for…a conduct violation, you must answer Yes…even if such action did not: interrupt your enrollment, require you to withdraw, or appear on your official transcripts.
Duke prehealth advising has a whole page on this. Educate yourself!

Don’t take the risk!

Let’s zoom out for a sec…

Data science and statistical thinking

Before spring break…

Data science: the real-world art of transforming messy, imperfect, incomplete data into knowledge;

After spring break…

Statistics: the mathematical discipline of quantifying our uncertainty about that knowledge.

Data science

Collection: we won’t seriously study this!
- for us: data importing (read_csv etc) or webscraping;
- but really: domain-specific issues of measurement, survey design, experimental design, etc;

Data collection

I sent out my lil’ survey with Google Forms, downloaded the responses in a CSV, and read that sucker in:

survey <- read_csv("data/survey-2026-02-09.csv")

Rows: 276 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Timestamp, How many classes do you have on Tuesdays?, What year are...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

survey

# A tibble: 276 × 3
   Timestamp         How many classes do you have on Tues…¹ `What year are you?`
   <chr>             <chr>                                  <chr>               
 1 2/9/2026 11:03:46 3                                      First-year          
 2 2/9/2026 11:29:24 2                                      Sophomore           
 3 2/9/2026 11:33:44 2                                      Sophomore           
 4 2/9/2026 11:33:48 2                                      Sophomore           
 5 2/9/2026 11:33:56 1                                      First-year          
 6 2/9/2026 11:33:56 3                                      First-year          
 7 2/9/2026 11:33:58 3                                      Sophomore           
 8 2/9/2026 11:34:07 3                                      Sophomore           
 9 2/9/2026 11:34:13 2                                      First-year          
10 2/9/2026 11:34:20 3                                      Junior              
# ℹ 266 more rows
# ℹ abbreviated name: ¹`How many classes do you have on Tuesdays?`

Data science

Collection: we won’t seriously study this!
- for us: data importing (read_csv) or webscraping;
- but really: domain-specific issues of measurement, survey design, experimental design, etc;

Preparation: cleaning, wrangling, and otherwise tidying the data so we can actually work with it.
- keywords: mutate, fct_relevel, pivot_*, *_join

Data preparation

survey <- survey |>
  rename(
    tue_classes = `How many classes do you have on Tuesdays?`,
    year = `What year are you?`
  ) |>
  mutate(
    tue_classes = case_when(
      tue_classes == "1 class, 1 lab, 1 volunteering session" ~ "2",
      tue_classes == "One" ~ "1",
      tue_classes == "Three" ~ "3",
      tue_classes == "Two" ~ "2",
      tue_classes == "Two in both days" ~ "2",
      tue_classes == "two" ~ "2",
      .default = tue_classes
    ),
    tue_classes = as.numeric(tue_classes),
    year = fct_relevel(year, "First-year", "Sophomore", "Junior", "Senior")
  ) |>
  select(tue_classes, year)
survey

# A tibble: 276 × 2
   tue_classes year      
         <dbl> <fct>     
 1           3 First-year
 2           2 Sophomore 
 3           2 Sophomore 
 4           2 Sophomore 
 5           1 First-year
 6           3 First-year
 7           3 Sophomore 
 8           3 Sophomore 
 9           2 First-year
10           3 Junior    
# ℹ 266 more rows

Data science

Collection: we won’t seriously study this!
- for us: data importing (read_csv) or webscraping;
- but really: domain-specific issues of measurement, survey design, experimental design, etc;
Preparation: cleaning, wrangling, and otherwise tidying the data so we can actually work with it.
- keywords: mutate, fct_relevel, pivot_*, *_join

Analysis: finally transform the data into knowledge…
- pictures: ggplot, geom_*, etc
- numerical summaries: summarize, group_by, count, mean, median, sd, quantile, IQR, cor, etc

Data analysis

A human being can learn nothing from staring at this box:

survey

# A tibble: 276 × 2
   tue_classes year      
         <dbl> <fct>     
 1           3 First-year
 2           2 Sophomore 
 3           2 Sophomore 
 4           2 Sophomore 
 5           1 First-year
 6           3 First-year
 7           3 Sophomore 
 8           3 Sophomore 
 9           2 First-year
10           3 Junior    
# ℹ 266 more rows

Data analysis

Picture!

ggplot(survey, aes(x = tue_classes, fill = year)) + 
  geom_bar(position = "dodge")

Data analysis

Better picture?

ggplot(survey, aes(x = tue_classes, fill = year)) + 
  geom_bar(position = "fill")

Data analysis

Numbers!

survey |>
  count(tue_classes, year) |>
  group_by(tue_classes) |>
  mutate(prop = n / sum(n))

# A tibble: 19 × 4
# Groups:   tue_classes [5]
   tue_classes year           n   prop
         <dbl> <fct>      <int>  <dbl>
 1           0 First-year     4 0.308 
 2           0 Sophomore      5 0.385 
 3           0 Junior         2 0.154 
 4           0 Senior         2 0.154 
 5           1 First-year    28 0.438 
 6           1 Sophomore     23 0.359 
 7           1 Junior        11 0.172 
 8           1 Senior         2 0.0312
 9           2 First-year    58 0.464 
10           2 Sophomore     46 0.368 
11           2 Junior        16 0.128 
12           2 Senior         5 0.04  
13           3 First-year    30 0.448 
14           3 Sophomore     21 0.313 
15           3 Junior        14 0.209 
16           3 Senior         2 0.0299
17           4 First-year     3 0.429 
18           4 Sophomore      1 0.143 
19           4 Junior         3 0.429

Data science

Collection: we won’t seriously study this!
- for us: data importing (read_csv) or webscraping;
- but really: domain-specific issues of measurement, survey design, experimental design, etc;
Preparation: cleaning, wrangling, and otherwise tidying the data so we can actually work with it.
- keywords: mutate, fct_relevel, pivot_*, *_join
Analysis: finally transform the data into knowledge…
- pictures: ggplot, geom_*, etc
- numerical summaries: summarize, group_by, count, mean, median, sd, quantile, iqr, cor, etc

The pictures and the summaries need to work together!

A cautionary tale: Anscombe’s quartet

Dataset I

    x     y
1  10  8.04
2   8  6.95
3  13  7.58
4   9  8.81
5  11  8.33
6  14  9.96
7   6  7.24
8   4  4.26
9  12 10.84
10  7  4.82
11  5  5.68

Dataset II

Dataset III

    x     y
1  10  7.46
2   8  6.77
3  13 12.74
4   9  7.11
5  11  7.81
6  14  8.84
7   6  6.08
8   4  5.39
9  12  8.15
10  7  6.42
11  5  5.73

Dataset IV

    x     y
1   8  6.58
2   8  5.76
3   8  7.71
4   8  8.84
5   8  8.47
6   8  7.04
7   8  5.25
8  19 12.50
9   8  5.56
10  8  7.91
11  8  6.89

A cautionary tale: Anscombe’s quartet

ggplot(anscombe_tidy, aes(x, y)) +
  geom_point() +
  facet_wrap(~ set)

A cautionary tale: Anscombe’s quartet

ggplot(anscombe_tidy, aes(x, y)) +
  geom_point() +
  facet_wrap(~ set) +
  geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

If you only looked at summary statistics…

anscombe_tidy |>
  group_by(set) |>
  summarize(
    xbar = mean(x),
    ybar = mean(y),
    sx = sd(x),
    sy = sd(y),
    r = cor(x, y)
  )

# A tibble: 4 × 6
  set    xbar  ybar    sx    sy     r
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 I         9  7.50  3.32  2.03 0.816
2 II        9  7.50  3.32  2.03 0.816
3 III       9  7.5   3.32  2.03 0.816
4 IV        9  7.50  3.32  2.03 0.817

Our motto: ABV!

No, not alcohol by volume…

Always!
Be!
Visualizing!

Study guide

Can you match these…

…with these?

Matching scatterplots and histograms

Dataframe with two numerical variables:

df

# A tibble: 100 × 2
        x       y
    <dbl>   <dbl>
 1 -0.554 -0.885 
 2  1.58   0.497 
 3  1.68  -0.135 
 4  1.69  -0.175 
 5 -0.652  0.331 
 6  1.59   0.0341
 7 -2.23  -0.174 
 8  0.571 -1.00  
 9  1.30  -1.09  
10  1.62   0.722 
# ℹ 90 more rows

Histograms for each:

ggplot(df, aes(x = x)) + 
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

ggplot(df, aes(x = y)) + 
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Scatterplot for both:

ggplot(df, aes(x = x, y = y)) + 
  geom_point()

2 goes with e

Everything is symmetric and concentrating in one place:

Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_bar()`).

7 goes with f

x and y each concentrate in two places:

Warning: Removed 67 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 68 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_bar()`).

3 goes with d

x concentrates in two places, but y only concentrates in one:

Warning: Removed 28 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 28 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_bar()`).

6 goes with c

Points are evenly spread everywhere:

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_bar()`).

4 goes with b

Asymetry in y but not in x:

Warning: Removed 60 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 60 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_bar()`).

5 goes with a

This one’s just weird. By process of elimination…

Warning: Removed 349 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 349 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_bar()`).

Can you match these…

…with these?

Yes you can

If you know how the box is drawn:

The line inside the box is the median (50% quantile);
The left edge is the 25% quantile;
The right edge is the 75% quantile;
The box spans the middle 50% of the data;
The whiskers and dots span the entire range of the data;

Boxplots display center, spread, and (a)symmetry, but not modality.

Whence the box?

The middle of the box is the median. 50% of the data are below, and 50% are above:

Whence the box?

The lower edge of the box is the 25% quantile. 25% of the data are below, and 75% are above:

Whence the box?

The upper edge of the box is the 75% quantile. 75% of the data are below, and 25% are above:

13 goes with d

The only one concentrating near 2:

12 goes with c

Concentrating around -2 and symmetric:

9 goes with a

Concentrating around -2 and asymmetric:

8 goes with f

Multimodality is a distraction. Pay attention to the whiskers.

10 goes with b

Also centered at 0, but more spread out than 8:

11 goes with e

Also centered at 0, but how wide must you make the box to swallow the middle 50%? Pretty darn wide: