Welcome to STA 199!

Lecture 0

John Zito

Duke University
STA 199 Spring 2026

2026-01-07

Teaching team: a glamorous assemblage

Bethany Akinola

Federico Arboleda

Jannis Bolik

Arijit Dey

Cael Elmore

Oliver Gao

Natasha Harris

Dwija Kakkad

Abuzar Khudaverdiyeva

Hyunjin Lee

Liane Ma

Chelsea Nguyen

Max Niu

Tory Norton

Patrick Pham

Kenna Roberts

Katie Solarz

Sarah Wu

Edward Zhang

Lisa Zhang

Mary Knox

John Zito

Office hours begin Monday January 12.

What are we studying?

First half:

Data science

  • Transforming messy, incomplete, imperfect data into knowledge;
  • Knowledge often takes the form of pictures and a concise set of numerical summaries.

Second half:

Statistical thinking

Quantifying our uncertainty about that knowledge.

Imagine this dialog

Campaign manager: What is the probability that our candidate wins the election?

(A flurry of analysis takes place.)

Data scientist: Our best guess is 54%.

Campaign manager: How reliable is that estimate? How confident are we in that? What’s the margin of error?

Parallel Universe 1

Data scientist: It’s 54% give or take 3%.

Parallel Universe 2

Data scientist: It’s 54% give or take 20%.

It’s all about decision-making under uncertainty

The manager is going to make wildly different decisions about campaign strategy and spending depending on how uncertain the environment is.

The shape of the course

Questions?

Syllabus highlights

Homepage

https://sta199-s26.github.io

  • All course materials
  • Links to Canvas, GitHub, RStudio containers, etc.

Course toolkit

All linked from the course website:

Activities

  • Introduce new content and prepare for lectures by watching the videos and completing the readings
  • Attend and actively participate in lectures (and answer questions for participation credit) and labs, office hours, team meetings
  • Practice applying statistical concepts and computing with application exercises during lecture
  • Put together what you’ve learned to analyze real-world data
    • Lab assignments
    • Homework assignments
    • Exams
    • Term project completed in teams

Grading

Category Percentage
Lectures (attendance + participation) 5%
Labs 8%
HW 12%
Project 15%
Midterm 1 20%
Midterm 2 20%
Final 20%

See course syllabus for how the final letter grade will be determined.

Wiggle room

  • After Drop/Add, you can miss 4 of 22 lectures and still receive full attendance credit;
  • We drop the two lowest labs;
  • We drop the lowest homework;
  • We replace the lowest midterm score with your final exam score (if it’s better).

Attendance and participation

  • Daily in lecture

  • Tracked for credit, but not based on correctness, only participation (and often they will be questions designed to make you think that might not have a single right answer!)

Application exercises

  • Lecture meetings will typically involve me babbling for 45 - 60 minutes, and then we work through a guided activity where you try out the latest material for yourself;

  • Not graded, but tracked for feedback on workflow

Labs

  • Hands-on practice with data analysis

  • A single exercise per lab, graded based on being there and turning in something reasonable + correctness

  • Completed in-person, in lab, in teams

  • Teams randomized each week until project teams assigned

  • Developed collaboratively, but turned in individually by the end of the lab session

  • 8 throughout semester, two lowest scores dropped

  • No late work accepted

Homework

  • Hands-on practice with data analysis

  • Some questions for practice with instant feedback by AI

  • Some questions to be graded by humans for correctness

  • Can start in lab if time permits, but completed at home

  • Can consult with course team and peers, but completed and turned in individually by the end of the week

  • 7 throughout semester, lowest score dropped

  • Up to 3 days late (-5% per day), no late work accepted after that

  • One-time late penalty waiver: Can be used on any homework assignment, no questions asked, must be requested from Dr. Knox before the deadline

Exams

  • Two midterm exams during semester, comprised of two parts:

    • (80%) In-class: 75 minute in-class exam. Closed book, one sheet of notes;

    • (20%) Take-home: Follow from the in class exam and focus on the analysis of a dataset introduced in the take home exam.

  • Final, in-class only: Closed book, one sheet of notes;

  • Notes for exams: Both sides of a single 8.5” x 11” sheet prepared by you and you alone;

  • No extensions or make-ups.

Caution

Exam dates cannot be changed and no make-up exams will be given. If you can’t take the exams on these dates, you should drop this class.

If you need testing accomodations

Make sure I get a letter, and make your appointments in the Testing Center now.

Project

  • Dataset of your choice, method of your choice

  • Teamwork

  • Interim deadlines throughout semester

  • Final milestone: Presentation in lab and write-up

  • Must be in lab, in-person to present

  • Peer review between teams for content, peer evaluation within teams for contribution

  • Some lab sessions allocated to project progress

Caution

Project due date cannot be changed. You must complete the project to pass this class.

Teams

  • Randomized at first for weekly labs

  • Then selected by you or assigned by me if you don’t express a preference for project and remaining labs

  • Expectations and roles

    • Everyone is expected to contribute equal effort
    • Everyone is expected to understand all code turned in
    • Individual contribution evaluated by peer evaluation, commits, etc.
  • For the project: Peer evaluation during teamwork and after completion

Questions?

A game

How old is this person?

 

Ethel Merman

Ethel Merman

Born January 16, 1908
Died February 15, 1984
Age 76
Claim to fame JZ’s favorite singer

How old is this person?

 

Megan Pete

Megan Thee Stallion

Born February 15, 1995
Age 30
Claim to fame Rapper

How old is this person?

 

봉준호

Bong Joon-ho

Born September 14, 1969
Age 56
Claim to fame Directed Parasite, Snowpiercer, etc

Now do it with pictures…

When the picture was taken, how old was the person?

Let’s see how you did

The data science life cycle

The data science life cycle

Load the data in

library(tidyverse)
age_guesses <- read_csv("data/age_guesses.csv")
# A tibble: 299 × 6
   celeb1 celeb2 celeb3 celeb4 celeb5 celeb6
    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1     32     26     73     68     50     78
 2     22     42     75     55     35     77
 3     22     31     79     40     43     68
 4     30     28     65     40     45     70
 5     35     39     67     54     49     60
 6     30     48     67     37     45     70
 7     49     42     78     47     62     75
 8     28     35     70     50     60     70
 9     31     36     62     43     41     68
10     26     33     67     60     42     71
# ℹ 289 more rows

Anyone know who this is?

Yuja Wang

Born 2/10/1987
Age in pic 36
Claim to fame Classical pianist

The data science life cycle

Baby’s first graphic

ggplot(age_guesses, aes(x = celeb1))

Baby’s first graphic

ggplot(age_guesses, aes(x = celeb1)) + 
  geom_histogram()

Baby’s first graphic

ggplot(age_guesses, aes(x = celeb1)) + 
  geom_histogram() + 
  labs(title = "sta199-s26 students guess Yuja Wang's age",
       x = "Guess")

Baby’s first graphic

ggplot(age_guesses, aes(x = celeb1)) + 
  geom_histogram() + 
  labs(title = "sta199-s26 students guess Yuja Wang's age",
       x = "Guess") + 
  geom_vline(xintercept = 36, color = "red")

Concise numerical summaries

age_guesses |>
  summarize(
    average = mean(celeb1),
    sdev = sd(celeb1)
  )
# A tibble: 1 × 2
  average  sdev
    <dbl> <dbl>
1    29.1  5.01

Anyone know who this is?

Joan Crawford

A secret she took to her grave:

Born 3/23/(1904 - 1908)
Died 5/10/1977
Age in pic 38 - 42
Claim to fame Oscar-winning actor

Joan Crawford

Anyone know who this is?

Eubie Blake

His actual birthday was not known at the time:

Born 2/7/1887
Died 2/12/1983
Age in pic 82
Claim to fame Composer

Eubie Blake

Watch out for data quality!

Raghuram Rajan

Born 2/3/1963
Age in pic 48
Claim to fame UChicago economist
RBI governor

Raghuram Rajan

Anyone know who this is?

Vincent Price

Born 5/27/1911
Died 10/25/1993
Age in pic 38 - 39
Claim to fame Horror actor

Vincent Price

Anyone know who this is?

Celia Cruz

Born 10/21/1925
Died 7/16/2003
Age in pic 76
Claim to fame Queen of Salsa

Celia Cruz

Wrap up

Silly exercise, serious themes

  • Domain knowledge and modeling assumptions: data do not speak for themselves. You need some subject-matter expertise about what you’re studying, as well as an interpretive lens;

    • Not all models or assumptions are created equal!
  • Are you asking questions the data can actually answer?

  • Uncertainty has many sources, and in some cases, it may be simply irreducible, no matter how hard you try;

  • Data quality and data cleaning: Data are not gospel. There could be noise and mistakes. Then what?

  • Wisdom of crowds: aggregating many imperfect guesses can do better than any one individual guess.

This week’s tasks

  • Attend Lab 0 tomorrow (very important!):
    • Computational setup;
    • Getting to know you survey;
  • Read the syllabus and ask questions on Ed;
  • Complete readings and videos for next class;
  • Send me SDAO letters and schedule testing center appointments.

Questions?