Grammar of data transformation

Lecture 3

Author

Affiliation

John Zito

Duke University
STA 199 Spring 2026

Published

January 21, 2026

Alison Bechdel

The Bechdel Test

(Dykes to Watch Out For - 1985)

Film passes if…

two female characters;
talk to each other;
about something besides a man.

What’s the last movie you saw in a theater?

Code

films <- read_csv("data/films.csv")

Rows: 329 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Films

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

films |>
  mutate(Films = if_else(is.na(Films), "(I can't recall)", Films)) |>
  mutate(Films = fct_lump_min(Films, min = 6, other_level = "Other")) |> 
  filter(Films != "Other") |>
  mutate(Films = fct_infreq(Films)) |>
  ggplot(aes(y = Films)) +
  geom_bar(fill = "skyblue") +
  theme_minimal() + 
  labs(y = "",
       title = "What's the last film you saw in a theater?",
       subtitle = "From the sta199-s26 survey",
       caption = "(Films with fewer than 6 responses were omitted for readability.)")

The last ten new releases I saw in a theater

Title	Year	JZ’s review	Bechdel
Living	2023	⭐⭐⭐⭐⭐	❌
The Boy and the Heron	2023	⭐⭐⭐⭐⭐	❌
Dune 2	2024	⭐⭐⭐	❌
Conclave	2024	⭐⭐	❌
Wicked 1	2024	⭐⭐	✅
Nosferatu	2024	⭐⭐⭐⭐	❌
Naked Gun	2025	⭐⭐⭐	❌
Bugonia	2025	⭐⭐⭐	❌
Wicked 2	2025	⭐	✅
Marty Supreme	2026	⭐⭐⭐⭐	❌

Can we reproduce this claim?

From FiveThirtyEight

“We did a statistical analysis of films to test two claims: first, that films that pass the Bechdel test — featuring women in stronger roles — see a lower return on investment, and second, that they see lower gross profits. We found no evidence to support either claim.”

`ae-03-bechdel-dataviz`

Let’s go:

Open your container;
Switch to your ae- project if you’re not already there;
If there are any un-committed changes in the Git tab (upper right), commit them;
The Git tab should now be completely empty;
Lastly, pull the new files I’ve distributed to you.

If that all went smoothly, open the new ae-03-bechdel-dataviz.qmd file.

Recap

Code cells (aka code chunks)

. . .

Cell labels are helpful for describing what the code is doing, for jumping between code cells in the editor, and for troubleshooting
message: false hides any messages emitted by the code in your rendered document

Describing distributions and relationships

Talking about one numerical variable

center: what is the “typical” value (mean, median, mode) the data are concentrating around?
spread: how concentrated are the data around a typical value?
shape: does the distribution have one peak, or many? is it symmetric or skewed?

Interaction between shape and center

If there is only one peak:

Histograms provide more detail…

Box plots hide multi-modality.

…but boxplots are nicer for side-by-side comparisons, especially with many groups

This is a little too busy IMO

Talking about two numerical variables

direction: positive or negative
shape: linear or nonlinear
strength: how close are points to the “trend”

Strength and direction of linear relationships

Nonlinear relationships

Data transformation

A quick reminder

1bechdel |>
2  filter(roi > 400) |>
3  select(title, roi, budget_2013, gross_2013, year, clean_test)

1: Start with the bechdel data frame
2: Filter for movies with roi greater than 400 (gross is more than 400 times budget)
3: Select the columns title, roi, budget_2013, gross_2013, year, and clean_test

# A tibble: 3 × 6
  title                     roi budget_2013 gross_2013  year clean_test
  <chr>                   <dbl>       <dbl>      <dbl> <dbl> <chr>     
1 Paranormal Activity      671.      505595  339424558  2007 dubious   
2 The Blair Witch Project  648.      839077  543776715  1999 ok        
3 El Mariachi              583.       11622    6778946  1992 nowomen

The pipe `|>`

The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.

sum(1, 2)

[1] 3

1 |> 
  sum(2)

[1] 3

select(filter(bechdel, roi > 400), title)

# A tibble: 3 × 1
  title                  
  <chr>                  
1 Paranormal Activity    
2 The Blair Witch Project
3 El Mariachi

bechdel |>
  filter(roi > 400) |>
  select(title)

# A tibble: 3 × 1
  title                  
  <chr>                  
1 Paranormal Activity    
2 The Blair Witch Project
3 El Mariachi

Code style tip

In data transformation pipelines, always use a
- space before |>
- line break after |>
- indent the next line of code

. . .

In data visualization layers, always use a
- space before +
- line break after +
- indent the next line of code

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

. . .

Start with the bechdel data frame:

bechdel

# A tibble: 1,615 × 7
   title                   year gross_2013 budget_2013    roi binary clean_test
   <chr>                  <dbl>      <dbl>       <dbl>  <dbl> <chr>  <chr>     
 1 21 & Over               2013   67878146    13000000  5.22  FAIL   notalk    
 2 Dredd 3D                2012   55078343    45658735  1.21  PASS   ok        
 3 12 Years a Slave        2013  211714070    20000000 10.6   FAIL   notalk    
 4 2 Guns                  2013  208105475    61000000  3.41  FAIL   notalk    
 5 42                      2013  190040426    40000000  4.75  FAIL   men       
 6 47 Ronin                2013  184166317   225000000  0.819 FAIL   men       
 7 A Good Day to Die Hard  2013  371598396    92000000  4.04  FAIL   notalk    
 8 About Time              2013  102648667    12000000  8.55  PASS   ok        
 9 Admission               2013   36014634    13000000  2.77  PASS   ok        
10 After Earth             2013  304895295   130000000  2.35  FAIL   notalk    
# ℹ 1,605 more rows

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Filter for rows where binary is equal to "PASS":

bechdel |>
  filter(binary == "PASS")

# A tibble: 753 × 7
   title                 year gross_2013 budget_2013   roi binary clean_test
   <chr>                <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 Dredd 3D              2012   55078343    45658735  1.21 PASS   ok        
 2 About Time            2013  102648667    12000000  8.55 PASS   ok        
 3 Admission             2013   36014634    13000000  2.77 PASS   ok        
 4 American Hustle       2013  397915817    40000000  9.95 PASS   ok        
 5 August: Osage County  2013   87609748    25000000  3.50 PASS   ok        
 6 Beautiful Creatures   2013   75392809    50000000  1.51 PASS   ok        
 7 Blue Jasmine          2013  101793664    18000000  5.66 PASS   ok        
 8 Carrie                2013  120268278    30000000  4.01 PASS   ok        
 9 Despicable Me 2       2013 1338831390    76000000 17.6  PASS   ok        
10 Elysium               2013  379242208   120000000  3.16 PASS   ok        
# ℹ 743 more rows

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Arrange the rows in descending order of roi:

bechdel |>
  filter(binary == "PASS") |>
  arrange(desc(roi))

# A tibble: 753 × 7
   title                     year gross_2013 budget_2013   roi binary clean_test
   <chr>                    <dbl>      <dbl>       <dbl> <dbl> <chr>  <chr>     
 1 The Blair Witch Project   1999  543776715      839077 648.  PASS   ok        
 2 The Devil Inside          2012  157289709     1014639 155.  PASS   ok        
 3 My Big Fat Greek Wedding  2002  768922942     6475896 119.  PASS   ok        
 4 Chasing Amy               1997   39417963      362810 109.  PASS   ok        
 5 Slacker                   1991    4200140       39349 107.  PASS   ok        
 6 Insidious                 2010  164379554     1602348 103.  PASS   ok        
 7 Paranormal Activity 2     2010  280159759     3204696  87.4 PASS   ok        
 8 Paranormal Activity 3     2011  322170936     5178454  62.2 PASS   ok        
 9 The Last Exorcism         2010  118787648     1922817  61.8 PASS   ok        
10 Cinderella                1997  246710482     4208591  58.6 PASS   ok        
# ℹ 743 more rows

The pipe, in action

Find movies that pass the Bechdel test and display their titles and ROIs in descending order of ROI.

Select columns title and roi:

bechdel |>
  filter(binary == "PASS") |>
  arrange(desc(roi)) |>
  select(title, roi)

# A tibble: 753 × 2
   title                      roi
   <chr>                    <dbl>
 1 The Blair Witch Project  648. 
 2 The Devil Inside         155. 
 3 My Big Fat Greek Wedding 119. 
 4 Chasing Amy              109. 
 5 Slacker                  107. 
 6 Insidious                103. 
 7 Paranormal Activity 2     87.4
 8 Paranormal Activity 3     62.2
 9 The Last Exorcism         61.8
10 Cinderella                58.6
# ℹ 743 more rows

In this class, you will…

Build cakes (ggplot)

Stack dolls (pipe |>)

Master these constructs, and everything will be coming up roses!

Alison Bechdel

The Bechdel Test

What’s the last movie you saw in a theater?

The last ten new releases I saw in a theater

Can we reproduce this claim?

ae-03-bechdel-dataviz

Recap

Code cells (aka code chunks)

Describing distributions and relationships

Talking about one numerical variable

Interaction between shape and center

Histograms provide more detail…

…but boxplots are nicer for side-by-side comparisons, especially with many groups

This is a little too busy IMO

Talking about two numerical variables

Strength and direction of linear relationships

Nonlinear relationships

Data transformation

A quick reminder

The pipe |>

Code style tip

The pipe, in action

The pipe, in action

The pipe, in action

The pipe, in action

In this class, you will…

`ae-03-bechdel-dataviz`

The pipe `|>`