Tidying data

Lecture 6

John Zito

Duke University
STA 199 Spring 2026

2026-02-02

Warm-up

While you wait…

Prepare for today’s application exercise: ae-06-majors-tidy

  • Go to your ae project in RStudio.

  • Make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • Click Pull to get today’s application exercise file: ae-06-majors-tidy.qmd.

  • Wait till the you’re prompted to work on the application exercise during class before editing the file.

Miscellany: logical operators

Generally useful in a filter() but will come up in various other places as well…

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?

Miscellany: logical operators (cont.)

Generally useful in a filter() but will come up in various other places as well…

operator definition
x & y is x AND y?
x | y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x? (only makes sense if x is TRUE or FALSE)

Miscellany: assignment

Let’s make a tiny data frame to use as an example:

df <- tibble(x = c(1, 2, 3, 4, 5), y = c("a", "a", "b", "c", "c"))
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 a    
3     3 b    
4     4 c    
5     5 c    

Miscellany: assignment

Suppose you run the following and then you inspect df, will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?

df |>
  mutate(x = x * 2)
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     2 a    
2     4 a    
3     6 b    
4     8 c    
5    10 c    
df

Miscellany: assignment

Suppose you run the following and then you inspect df, will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?

df |>
  mutate(x = x * 2)
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     2 a    
2     4 a    
3     6 b    
4     8 c    
5    10 c    
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 a    
3     3 b    
4     4 c    
5     5 c    

Do something and show me

Miscellany: assignment

Suppose you run the following and then you inspect df, will the x variable has values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?

df <- df |>
  mutate(x = x * 2)
df

Miscellany: assignment

Suppose you run the following and then you inspect df, will the x variable has values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?

df <- df |>
  mutate(x = x * 2)
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     2 a    
2     4 a    
3     6 b    
4     8 c    
5    10 c    

Do something and save result

Miscellany: assignment

Do something, save result, overwriting original

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df <- df |>
  mutate(x = x * 2)
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     2 a    
2     4 a    
3     6 b    
4     8 c    
5    10 c    

Do something, save result, not overwriting original

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df_new <- df |>
  mutate(x = x * 2)
df_new
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     2 a    
2     4 a    
3     6 b    
4     8 c    
5    10 c    

Miscellany: assignment

Do something, save result, overwriting original when you shouldn’t

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df <- df |>
  group_by(y) |>
  summarize(mean_x = mean(x))
df
# A tibble: 3 × 2
  y     mean_x
  <chr>  <dbl>
1 a        1.5
2 b        3  
3 c        4.5

Do something, save result, not overwriting original when you shouldn’t

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df_summary <- df |>
  group_by(y) |>
  summarize(mean_x = mean(x))
df_summary
# A tibble: 3 × 2
  y     mean_x
  <chr>  <dbl>
1 a        1.5
2 b        3  
3 c        4.5

Miscellany: assignment

Do something, save result, overwriting original
data frame

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df <- df |>
  mutate(z = x + 2)
df
# A tibble: 5 × 3
      x y         z
  <dbl> <chr> <dbl>
1     1 a         3
2     2 a         4
3     3 b         5
4     4 c         6
5     5 c         7

Do something, save result, overwriting original
column

df <- tibble(
  x = c(1, 2, 3, 4, 5), 
  y = c("a", "a", "b", "c", "c")
)
df <- df |>
  mutate(x = x + 2)
df
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     3 a    
2     4 a    
3     5 b    
4     6 c    
5     7 c    

Last time

We had to mutate before plotting. Two methods:

Do it all at once and don’t store:

gerrymander |>
  mutate(flip18 = as_factor(flip18)) |>
  ggplot(aes(x = gerry, fill = flip18)) +
  geom_bar(position = "fill")

Mutate, store change, then plot

gerrymander <- gerrymander |>
  mutate(flip18 = as_factor(flip18))

ggplot(gerrymander, aes(x = gerry, fill = flip18)) +
  geom_bar(position = "fill")

A common mistake

Why is this a travesty?

gerrymander <- gerrymander |>
  mutate(flip18 = as_factor(flip18)) |>
  ggplot(aes(x = gerry, fill = flip18)) +
  geom_bar(position = "fill")

The data are lost!

gerrymander is no longer a data frame. It’s a plot…thing:

gerrymander |>
  filter(state == "TX")
Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "c('ggplot2::ggplot', 'ggplot', 'ggplot2::gg', 'S7_object', 'gg')"

Lesson

You can overwrite an object if its identity is preserved; if the rows or columns now mean something totally different, or if it’s an entirely different class of object, then give it a new name.

Data tidying

Tidy data

“Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.”

Tidy Data, https://vita.had.co.nz/papers/tidy-data.pdf

Note: “easy to manipulate” = “straightforward to manipulate”

Goal

Visualize StatSci majors over the years!

Data

statsci
# A tibble: 4 × 16
  degree_type `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019`
  <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 AB2              0      1      0      0      4      4      1      0      0
2 AB               2      2      4      1      3      6      3      4      4
3 BS2              2      6      1      0      5      6      6      8      8
4 BS               5      9      4     13     10     17     24     21     26
# ℹ 6 more variables: `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>,
#   `2024` <dbl>, `2025` <dbl>
  • The first column (variable) is the degree:

    • BS (Bachelor of Science)
    • BS2 (Bachelor of Science, 2nd major)
    • AB (Bachelor of Arts)
    • AB2 (Bachelor of Arts, 2nd major)
  • The remaining columns show the number of students graduating with that major in a given academic year from 2011 to 2025.

Let’s plan!

Review the goal plot and sketch the data frame needed to create it. What would go inside aes when we call ggplot?

The Goal

We want to write code that starts something like this:

ggplot(statsci, aes(x = year, y = n, color = degree_type)) + 
  ...


But our data are not in the right format :(

statsci
# A tibble: 4 × 16
  degree_type `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019`
  <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 AB2              0      1      0      0      4      4      1      0      0
2 AB               2      2      4      1      3      6      3      4      4
3 BS2              2      6      1      0      5      6      6      8      8
4 BS               5      9      4     13     10     17     24     21     26
# ℹ 6 more variables: `2020` <dbl>, `2021` <dbl>, `2022` <dbl>, `2023` <dbl>,
#   `2024` <dbl>, `2025` <dbl>

The Challenge

How do we go from this ….

# A tibble: 4 x 16
  degree_type 2011 2012 2013 2014 2015 2016 2017 2018 2019
1         AB2    0    1    0    0    4    4    1    0    0
2          AB    2    2    4    1    3    6    3    4    4
3         BS2    2    6    1    0    5    6    6    8    8
4          BS    5    9    4   13   10   17   24   21   26

…. to this??

# A tibble: 60 x 3
   degree_type year n
1          AB2 2011 0
2          AB2 2012 1
3          AB2 2013 0
4          AB2 2014 0
5          AB2 2015 4
6          AB2 2016 4
7          AB2 2017 1
8          AB2 2018 0
9          AB2 2019 0
10         AB2 2020 1
11         AB2 2021 2
12         AB2 2022 0
13         AB2 2023 3
14         AB2 2024 1
15         AB2 2025 0
16          AB 2011 2
17          AB 2012 2

Pivot

pivot_longer()

# A tibble: 4 x 16
  degree_type 2011 2012 2013 2014 2015 2016 2017 2018 2019
1         AB2    0    1    0    0    4    4    1    0    0
2          AB    2    2    4    1    3    6    3    4    4
3         BS2    2    6    1    0    5    6    6    8    8
4          BS    5    9    4   13   10   17   24   21   26

Pivot the statsci data frame longer such that each row represents a degree type / year combination.

year and number of graduates for that year are columns in the result data frame.

# A tibble: 60 x 3
   degree_type year n
1          AB2 2011 0
2          AB2 2012 1
3          AB2 2013 0
4          AB2 2014 0
5          AB2 2015 4
6          AB2 2016 4
7          AB2 2017 1
8          AB2 2018 0
9          AB2 2019 0
10         AB2 2020 1
11         AB2 2021 2
12         AB2 2022 0
13         AB2 2023 3
14         AB2 2024 1
15         AB2 2025 0
16          AB 2011 2
17          AB 2012 2

pivot_longer()

# A tibble: 4 x 16
  degree_type 2011 2012 2013 2014 2015 2016 2017 2018 2019
1         AB2    0    1    0    0    4    4    1    0    0
2          AB    2    2    4    1    3    6    3    4    4
3         BS2    2    6    1    0    5    6    6    8    8
4          BS    5    9    4   13   10   17   24   21   26
statsci |>
  pivot_longer(
    
    cols = ___________________ ,
    
    names_to = _______________ ,
    
    values_to = ______________
  )

# A tibble: 60 x 3
   degree_type year n
1          AB2 2011 0
2          AB2 2012 1
3          AB2 2013 0
4          AB2 2014 0
5          AB2 2015 4
6          AB2 2016 4
7          AB2 2017 1
8          AB2 2018 0
9          AB2 2019 0
10         AB2 2020 1
11         AB2 2021 2
12         AB2 2022 0
13         AB2 2023 3
14         AB2 2024 1
15         AB2 2025 0
16          AB 2011 2
17          AB 2012 2

year

statsci |> 
  pivot_longer(
    cols = -degree_type,
    values_to = "n",
    names_to = "year"
  )
# A tibble: 60 × 3
   degree_type year      n
   <chr>       <chr> <dbl>
 1 AB2         2011      0
 2 AB2         2012      1
 3 AB2         2013      0
 4 AB2         2014      0
 5 AB2         2015      4
 6 AB2         2016      4
 7 AB2         2017      1
 8 AB2         2018      0
 9 AB2         2019      0
10 AB2         2020      1
# ℹ 50 more rows

What is the type of the year variable? Why? What should it be?

  • It’s a character (chr) variable since the information came from the columns of the original data frame.

  • R cannot know that these character strings represent years.

  • The variable type should be numeric.

pivot_longer() again

This time, also make sure year is a numerical variable in the resulting data frame.

statsci |>
  pivot_longer(
    cols = -degree_type,
    values_to = "n",
    names_to = "year"
  )

pivot_longer() again

This time, also make sure year is a numerical variable in the resulting data frame.

statsci |>
  pivot_longer(
    cols = -degree_type,
    values_to = "n",
    names_to = "year",
    names_transform = as.numeric
  )
# A tibble: 60 × 3
   degree_type  year     n
   <chr>       <dbl> <dbl>
 1 AB2          2011     0
 2 AB2          2012     1
 3 AB2          2013     0
 4 AB2          2014     0
 5 AB2          2015     4
 6 AB2          2016     4
 7 AB2          2017     1
 8 AB2          2018     0
 9 AB2          2019     0
10 AB2          2020     1
# ℹ 50 more rows

ae-06-majors-tidy

  • Go to your ae project in RStudio.

  • If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • If you haven’t yet done so, click Pull to get today’s application exercise file: ae-06-majors-tidy.qmd.

  • Work through the application exercise in class, and render, commit, and push your edits by the end of class.

Pivot Wider

We pivotted longer… what about wider?

# A tibble: 4 x 16
  degree_type 2011 2012 2013 2014 2015 2016 2017 2018 2019
1         AB2    0    1    0    0    4    4    1    0    0
2          AB    2    2    4    1    3    6    3    4    4
3         BS2    2    6    1    0    5    6    6    8    8
4          BS    5    9    4   13   10   17   24   21   26

Can we go the other direction?

# A tibble: 60 x 3
   degree_type year n
1          AB2 2011 0
2          AB2 2012 1
3          AB2 2013 0
4          AB2 2014 0
5          AB2 2015 4
6          AB2 2016 4
7          AB2 2017 1
8          AB2 2018 0
9          AB2 2019 0
10         AB2 2020 1
11         AB2 2021 2
12         AB2 2022 0
13         AB2 2023 3
14         AB2 2024 1
15         AB2 2025 0
16          AB 2011 2
17          AB 2012 2

We pivotted longer… what about wider?

# A tibble: 4 x 16
  degree_type 2011 2012 2013 2014 2015 2016 2017 2018 2019
1         AB2    0    1    0    0    4    4    1    0    0
2          AB    2    2    4    1    3    6    3    4    4
3         BS2    2    6    1    0    5    6    6    8    8
4          BS    5    9    4   13   10   17   24   21   26
statsci_longer |> 
  pivot_wider(
    names_from = ____________ ,
    
    values_from = ___________ , 
  )

# A tibble: 60 x 3
   degree_type year n
1          AB2 2011 0
2          AB2 2012 1
3          AB2 2013 0
4          AB2 2014 0
5          AB2 2015 4
6          AB2 2016 4
7          AB2 2017 1
8          AB2 2018 0
9          AB2 2019 0
10         AB2 2020 1
11         AB2 2021 2
12         AB2 2022 0
13         AB2 2023 3
14         AB2 2024 1
15         AB2 2025 0
16          AB 2011 2
17          AB 2012 2

Recap: Pivot

  • When should you pivot? If all of the data you need is in your data frame, but the columns you need don’t exist, there is a good chance it’s time to pivot!
  • Wide and long: Data sets can’t be labeled as wide or long but they can be made wider or longer for a certain analysis that requires a certain format
  • Pivot longer - data type: When pivoting longer, variable names that turn into values are characters by default. If you need them to be in another format, you need to explicitly make that transformation, which you can do so within the pivot_longer() function.

Recap: Plotting

  • You can tweak a plot forever, but at some point the tweaks are likely not very productive.

  • However, you should always be critical of defaultsand see if you can improve the plot to better portray your data / results / what you want to communicate.