AE 10: sales + import

Getting started

Packages

We will use the following two packages in this application exercise.

  • tidyverse: For data import, wrangling, and visualization.
  • readxl: For importing data from Excel.

Sales

Sales data are stored in an Excel file that looks like the following:

Task 1: Read in the Excel file called sales.xlsx from the data folder such that it looks like the following.

sales_raw <- read_excel(
  "data/sales.xlsx", 
  skip = 3,
  col_names = c("id", "n")
  )

sales_raw
# A tibble: 9 × 2
  id      n    
  <chr>   <chr>
1 Brand 1 n    
2 1234    8    
3 8721    2    
4 1822    3    
5 Brand 2 n    
6 3333    1    
7 2156    3    
8 3987    6    
9 3216    5    

Task 2: Manipulate the sales data such such that it looks like the following.

sales <- sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  ) |>
  fill(brand) |>
  filter(!is_brand_name) |>
  select(brand, id, n) |>
  mutate(
    id = as.numeric(id),
    n = as.numeric(n)
  )

sales
# A tibble: 7 × 3
  brand      id     n
  <chr>   <dbl> <dbl>
1 Brand 1  1234     8
2 Brand 1  8721     2
3 Brand 1  1822     3
4 Brand 2  3333     1
5 Brand 2  2156     3
6 Brand 2  3987     6
7 Brand 2  3216     5

Why should we bother with writing code for reading the data in by skipping columns and assigning variable names as well as cleaning it up in multiple steps instead of opening the Excel file and editing the data in there to prepare it for a clean import?

Answer: Because the code allows us to struggle once and re-use for future datasets and leaves a transparent trail of our modifications while manipulating the data in Excel directly is neither reproducible nor reusable.

Break it down

The pipeline for Task 2 is subtle. Here is all of the intermediate output broken down step-by-step. Go through it and make sure you understand what each step did and why it was necessary.

1. Which rows are the funky ones?

sales <- sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand")
  )

sales
# A tibble: 9 × 3
  id      n     is_brand_name
  <chr>   <chr> <lgl>        
1 Brand 1 n     TRUE         
2 1234    8     FALSE        
3 8721    2     FALSE        
4 1822    3     FALSE        
5 Brand 2 n     TRUE         
6 3333    1     FALSE        
7 2156    3     FALSE        
8 3987    6     FALSE        
9 3216    5     FALSE        

2. Initialize the brand column

sales <- sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  )

sales
# A tibble: 9 × 4
  id      n     is_brand_name brand  
  <chr>   <chr> <lgl>         <chr>  
1 Brand 1 n     TRUE          Brand 1
2 1234    8     FALSE         <NA>   
3 8721    2     FALSE         <NA>   
4 1822    3     FALSE         <NA>   
5 Brand 2 n     TRUE          Brand 2
6 3333    1     FALSE         <NA>   
7 2156    3     FALSE         <NA>   
8 3987    6     FALSE         <NA>   
9 3216    5     FALSE         <NA>   

3. Fill empty cells with the value above

sales <- sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  ) |>
  fill(brand)

sales
# A tibble: 9 × 4
  id      n     is_brand_name brand  
  <chr>   <chr> <lgl>         <chr>  
1 Brand 1 n     TRUE          Brand 1
2 1234    8     FALSE         Brand 1
3 8721    2     FALSE         Brand 1
4 1822    3     FALSE         Brand 1
5 Brand 2 n     TRUE          Brand 2
6 3333    1     FALSE         Brand 2
7 2156    3     FALSE         Brand 2
8 3987    6     FALSE         Brand 2
9 3216    5     FALSE         Brand 2

4. Throw out the funky rows

sales <- sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  ) |>
  fill(brand) |>
  filter(!is_brand_name) 

sales
# A tibble: 7 × 4
  id    n     is_brand_name brand  
  <chr> <chr> <lgl>         <chr>  
1 1234  8     FALSE         Brand 1
2 8721  2     FALSE         Brand 1
3 1822  3     FALSE         Brand 1
4 3333  1     FALSE         Brand 2
5 2156  3     FALSE         Brand 2
6 3987  6     FALSE         Brand 2
7 3216  5     FALSE         Brand 2

5. Throw out the unnecessary columns (and re-order)

sales <- sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  ) |>
  fill(brand) |>
  filter(!is_brand_name) |>
  select(brand, id, n) 

sales
# A tibble: 7 × 3
  brand   id    n    
  <chr>   <chr> <chr>
1 Brand 1 1234  8    
2 Brand 1 8721  2    
3 Brand 1 1822  3    
4 Brand 2 3333  1    
5 Brand 2 2156  3    
6 Brand 2 3987  6    
7 Brand 2 3216  5    

6. Make sure id and n are actually numbers

sales <- sales_raw |>
  mutate(
    is_brand_name = str_detect(id, "Brand"),
    brand = if_else(is_brand_name, id, NA)
  ) |>
  fill(brand) |>
  filter(!is_brand_name) |>
  select(brand, id, n) |>
  mutate(
    id = as.numeric(id),
    n = as.numeric(n)
  )

sales
# A tibble: 7 × 3
  brand      id     n
  <chr>   <dbl> <dbl>
1 Brand 1  1234     8
2 Brand 1  8721     2
3 Brand 1  1822     3
4 Brand 2  3333     1
5 Brand 2  2156     3
6 Brand 2  3987     6
7 Brand 2  3216     5