Data types and classes

Lecture 8

Author

Affiliation

John Zito

Duke University
STA 199 Spring 2026

Published

February 9, 2026

Warm-up

While you wait…

Loading…

Prepare for today’s application exercise: ae-08-durham-climate-factors

Go to your ae project in RStudio.
Make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
Click Pull to get today’s application exercise file: ae-08-durham-climate-factors.qmd.
Wait till the you’re prompted to work on the application exercise during class before editing the file.

Data types

How many classes do you have on Tuesdays?

survey

# A tibble: 276 × 3
   Timestamp         How many classes do you have on Tues…¹ `What year are you?`
   <chr>             <chr>                                  <chr>               
 1 2/9/2026 11:03:46 3                                      First-year          
 2 2/9/2026 11:29:24 2                                      Sophomore           
 3 2/9/2026 11:33:44 2                                      Sophomore           
 4 2/9/2026 11:33:48 2                                      Sophomore           
 5 2/9/2026 11:33:56 1                                      First-year          
 6 2/9/2026 11:33:56 3                                      First-year          
 7 2/9/2026 11:33:58 3                                      Sophomore           
 8 2/9/2026 11:34:07 3                                      Sophomore           
 9 2/9/2026 11:34:13 2                                      First-year          
10 2/9/2026 11:34:20 3                                      Junior              
# ℹ 266 more rows
# ℹ abbreviated name: ¹`How many classes do you have on Tuesdays?`

`rename()` variables

To make them easier to work with…

survey <- survey |>
  rename(
    tue_classes = `How many classes do you have on Tuesdays?`,
    year = `What year are you?`
  )

Variable types

What type of variable is tue_classes?

survey

# A tibble: 276 × 3
   Timestamp         tue_classes year      
   <chr>             <chr>       <chr>     
 1 2/9/2026 11:03:46 3           First-year
 2 2/9/2026 11:29:24 2           Sophomore 
 3 2/9/2026 11:33:44 2           Sophomore 
 4 2/9/2026 11:33:48 2           Sophomore 
 5 2/9/2026 11:33:56 1           First-year
 6 2/9/2026 11:33:56 3           First-year
 7 2/9/2026 11:33:58 3           Sophomore 
 8 2/9/2026 11:34:07 3           Sophomore 
 9 2/9/2026 11:34:13 2           First-year
10 2/9/2026 11:34:20 3           Junior    
# ℹ 266 more rows

Variable types

Why isn’t the tue_classes column numeric?

survey |>
  count(tue_classes)

# A tibble: 11 × 2
   tue_classes                                n
   <chr>                                  <int>
 1 0                                         13
 2 1                                         62
 3 1 class, 1 lab, 1 volunteering session     1
 4 2                                        118
 5 3                                         65
 6 4                                          7
 7 One                                        2
 8 Three                                      2
 9 Two                                        4
10 Two in both days                           1
11 two                                        1

Let’s clean it up

It’s a huge pain in the rear:

survey <- survey |>
  mutate(
    tue_classes = case_when(
      tue_classes == "1 class, 1 lab, 1 volunteering session" ~ "2",
      tue_classes == "One" ~ "1",
      tue_classes == "Three" ~ "3",
      tue_classes == "Two" ~ "2",
      tue_classes == "Two in both days" ~ "2",
      tue_classes == "two" ~ "2",
      .default = tue_classes
    ),
    tue_classes = as.numeric(tue_classes)
  )

survey |>
  count(tue_classes)

# A tibble: 5 × 2
  tue_classes     n
        <dbl> <int>
1           0    13
2           1    64
3           2   125
4           3    67
5           4     7

Data types

Data types in R

logical
double
integer
character
and some more, but we won’t be focusing on those

Logical & character

logical - Boolean values TRUE and FALSE

typeof(TRUE)

[1] "logical"

character - character strings

typeof("First-year")

[1] "character"

Double & integer

double - floating point numerical values (default numerical type)

typeof(2.5)

[1] "double"

typeof(3)

[1] "double"

integer - integer numerical values (indicated with an L)

typeof(3L)

[1] "integer"

typeof(1:3)

[1] "integer"

Concatenation

Vectors can be constructed using the c() function.

Numeric vector:

c(1, 2, 3)

[1] 1 2 3

. . .

Character vector:

c("Hello", "World!")

[1] "Hello"  "World!"

. . .

Vector made of vectors:

c(c("hi", "hello"), c("bye", "jello"))

[1] "hi"    "hello" "bye"   "jello"

Converting between types

with intention…

x <- 1:3
x

[1] 1 2 3

typeof(x)

[1] "integer"

y <- as.character(x)
y

[1] "1" "2" "3"

typeof(y)

[1] "character"

Converting between types

with intention…

x <- c(TRUE, FALSE)
x

[1]  TRUE FALSE

typeof(x)

[1] "logical"

y <- as.numeric(x)
y

[1] 1 0

typeof(y)

[1] "double"

Converting between types

without intention…

c(2, "Just this one!")

[1] "2"              "Just this one!"

. . .

R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that’s not always a great thing!

Converting between types

without intention…

c(FALSE, 3L)

[1] 0 3

. . .

c(1.2, 3L)

[1] 1.2 3.0

. . .

c(2L, "two")

[1] "2"   "two"

Explicit vs. implicit coercion

Explicit coercion:

When you call a function like as.logical(), as.numeric(), as.integer(), as.double(), or as.character().

Implicit coercion:

Happens when you use a vector in a specific context that expects a certain type of vector.

You’ve seen explicit coercion before

statsci |> 
  pivot_longer(
    cols = -degree_type,
    values_to = "n",
    names_to = "year"
  )

# A tibble: 60 × 3
   degree_type year      n
   <chr>       <chr> <dbl>
 1 AB2         2011      0
 2 AB2         2012      1
 3 AB2         2013      0
 4 AB2         2014      0
 5 AB2         2015      4
 6 AB2         2016      4
 7 AB2         2017      1
 8 AB2         2018      0
 9 AB2         2019      0
10 AB2         2020      1
# ℹ 50 more rows

statsci |>
  pivot_longer(
    cols = -degree_type,
    values_to = "n",
    names_to = "year",
    names_transform = as.numeric
  )

# A tibble: 60 × 3
   degree_type  year     n
   <chr>       <dbl> <dbl>
 1 AB2          2011     0
 2 AB2          2012     1
 3 AB2          2013     0
 4 AB2          2014     0
 5 AB2          2015     4
 6 AB2          2016     4
 7 AB2          2017     1
 8 AB2          2018     0
 9 AB2          2019     0
10 AB2          2020     1
# ℹ 50 more rows

Data classes

Vectors are like Lego building blocks
We stick them together to build more complicated constructs, e.g. representations of data
The class attribute relates to the S3 class of an object which determines its behaviour
- You don’t need to worry about what S3 classes really mean, but you can read more about it here if you’re curious
Examples: factors, dates, and data frames

Factors

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values

class_years <- factor(
  c(
    "First-year", "Sophomore", "Sophomore", "Senior", "Junior"
    )
  )
class_years

[1] First-year Sophomore  Sophomore  Senior     Junior    
Levels: First-year Junior Senior Sophomore

typeof(class_years)

[1] "integer"

class(class_years)

[1] "factor"

More on factors

We can think of factors like character (level labels) and an integer (level numbers) glued together

glimpse(class_years)

 Factor w/ 4 levels "First-year","Junior",..: 1 4 4 3 2

as.integer(class_years)

[1] 1 4 4 3 2

Dates

today <- as.Date("2024-09-24")
today

[1] "2024-09-24"

typeof(today)

[1] "double"

class(today)

[1] "Date"

More on dates

We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) and an integer (the origin) glued together

as.integer(today)

[1] 19990

as.integer(today) / 365 # roughly 55 yrs

[1] 54.76712

Data frames

We can think of data frames like vectors of equal length glued together

df <- data.frame(x = 1:2, y = 3:4)
df

  x y
1 1 3
2 2 4

typeof(df)

[1] "list"

class(df)

[1] "data.frame"

Lists

Lists are a generic vector container; vectors of any type can go in them

l <- list(
  x = 1:4,
  y = c("hi", "hello", "jello"),
  z = c(TRUE, FALSE)
)
l

$x
[1] 1 2 3 4

$y
[1] "hi"    "hello" "jello"

$z
[1]  TRUE FALSE

Lists and data frames

A data frame is a special list containing vectors of equal length

df

  x y
1 1 3
2 2 4

When we use the pull() function, we extract a vector from the data frame

df |>
  pull(y)

[1] 3 4

Working with factors

Read data in as character strings

survey

# A tibble: 276 × 3
   Timestamp         tue_classes year      
   <chr>                   <dbl> <chr>     
 1 2/9/2026 11:03:46           3 First-year
 2 2/9/2026 11:29:24           2 Sophomore 
 3 2/9/2026 11:33:44           2 Sophomore 
 4 2/9/2026 11:33:48           2 Sophomore 
 5 2/9/2026 11:33:56           1 First-year
 6 2/9/2026 11:33:56           3 First-year
 7 2/9/2026 11:33:58           3 Sophomore 
 8 2/9/2026 11:34:07           3 Sophomore 
 9 2/9/2026 11:34:13           2 First-year
10 2/9/2026 11:34:20           3 Junior    
# ℹ 266 more rows

But coerce when plotting

ggplot(survey, mapping = aes(x = year)) +
  geom_bar()

Use forcats to reorder levels

survey |>
  mutate(
    year = fct_relevel(year, "First-year", "Sophomore", "Junior", "Senior")
  ) |>
  ggplot(mapping = aes(x = year)) +
  geom_bar()

A peek into forcats

Reordering levels by:

fct_relevel(): hand
fct_infreq(): frequency
fct_reorder(): sorting along another variable
fct_rev(): reversing

…

. . .

Changing level values by:

fct_lump(): lumping uncommon levels together into “other”
fct_other(): manually replacing some levels with “other”

…

Warm-up

While you wait…

Data types

How many classes do you have on Tuesdays?

rename() variables

Variable types

Variable types

Let’s clean it up

Data types

Data types in R

Logical & character

Double & integer

Concatenation

Converting between types

Converting between types

Converting between types

Converting between types

Explicit vs. implicit coercion

You’ve seen explicit coercion before

Data classes

Data classes

Factors

More on factors

Dates

More on dates

Data frames

Lists

Lists and data frames

Working with factors

Read data in as character strings

But coerce when plotting

Use forcats to reorder levels

A peek into forcats

`rename()` variables