Lecture 4
Duke University
STA 199 Spring 2026
2026-01-26
Prepare for today’s application exercise: ae-04-gerrymander-explore-I
Switch to your ae project in RStudio;
Make sure all of your changes up to this point are committed (ie there’s nothing left in your Git pane);
Click Pull to get today’s application exercise file: ae-04-gerrymander-explore-I.qmd.
Then push. So Render > Commit > Pull > Push.
(To see the new file, you may have to refresh the Files pane by clicking “Home”)
Wait till the you’re prompted to work on the application exercise during class before editing the file.
| Data set | Occasion | Source | A row was a… |
|---|---|---|---|
age_guesses |
Lecture 0 | file | STAAWANANA student |
penguins |
Lecture 1 | package | penguin |
unvotes |
Lecture 2 | package | year-issue-country |
nc_county |
Lab, HW | file | NC county |
bechdel |
Lecture 3 | file | film |
gerrymander |
Lecture 4 | package | district |
gerrymander
# A tibble: 435 × 12
district last_name first_name party16 clinton16 trump16 dem16 state party18
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 AK-AL Young Don R 37.6 52.8 0 AK R
2 AL-01 Byrne Bradley R 34.1 63.5 0 AL R
3 AL-02 Roby Martha R 33 64.9 0 AL R
4 AL-03 Rogers Mike D. R 32.3 65.3 0 AL R
5 AL-04 Aderholt Rob R 17.4 80.4 0 AL R
6 AL-05 Brooks Mo R 31.3 64.7 0 AL R
7 AL-06 Palmer Gary R 26.1 70.8 0 AL R
8 AL-07 Sewell Terri D 69.8 28.6 1 AL D
9 AR-01 Crawford Rick R 30.2 65 0 AR R
10 AR-02 Hill French R 41.7 52.4 0 AR R
# ℹ 425 more rows
# ℹ 3 more variables: dem18 <dbl>, flip18 <dbl>, gerry <fct>
gerrymander
What is a good first function to use to get to know a dataset?
Rows: 435
Columns: 12
$ district <chr> "AK-AL", "AL-01", "AL-02", "AL-03", "AL-04", "AL-05", "AL-0…
$ last_name <chr> "Young", "Byrne", "Roby", "Rogers", "Aderholt", "Brooks", "…
$ first_name <chr> "Don", "Bradley", "Martha", "Mike D.", "Rob", "Mo", "Gary",…
$ party16 <chr> "R", "R", "R", "R", "R", "R", "R", "D", "R", "R", "R", "R",…
$ clinton16 <dbl> 37.6, 34.1, 33.0, 32.3, 17.4, 31.3, 26.1, 69.8, 30.2, 41.7,…
$ trump16 <dbl> 52.8, 63.5, 64.9, 65.3, 80.4, 64.7, 70.8, 28.6, 65.0, 52.4,…
$ dem16 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
$ state <chr> "AK", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AR", "AR",…
$ party18 <chr> "R", "R", "R", "R", "R", "R", "R", "D", "R", "R", "R", "R",…
$ dem18 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0,…
$ flip18 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
$ gerry <fct> mid, high, high, high, high, high, high, high, mid, mid, mi…
gerrymander
Rows: Congressional districts
Columns:
Congressional district and state
2016 election: winning party, % for Clinton, % for Trump, whether a Democrat won the House election, name of election winner
2018 election: winning party, whether a Democrat won the 2018 House election
Whether a Democrat flipped the seat in the 2018 election
Prevalence of gerrymandering: low, mid, and high
district
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
|
first_name |
|
party16 |
|
clinton16 |
|
trump16 |
|
dem16 |
|
state |
|
party18 |
|
dem18 |
|
flip18 |
|
gerry |
last_name
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
|
party16 |
|
clinton16 |
|
trump16 |
|
dem16 |
|
state |
|
party18 |
|
dem18 |
|
flip18 |
|
gerry |
first_name
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
|
clinton16 |
|
trump16 |
|
dem16 |
|
state |
|
party18 |
|
dem18 |
|
flip18 |
|
gerry |
party16
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
categorical |
clinton16 |
|
trump16 |
|
dem16 |
|
state |
|
party18 |
|
dem18 |
|
flip18 |
|
gerry |
clinton16
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
categorical |
clinton16 |
numerical, continuous |
trump16 |
|
dem16 |
|
state |
|
party18 |
|
dem18 |
|
flip18 |
|
gerry |
trump16
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
categorical |
clinton16 |
numerical, continuous |
trump16 |
numerical, continuous |
dem16 |
|
state |
|
party18 |
|
dem18 |
|
flip18 |
|
gerry |
dem16
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
categorical |
clinton16 |
numerical, continuous |
trump16 |
numerical, continuous |
dem16 |
categorical |
state |
|
party18 |
|
dem18 |
|
flip18 |
|
gerry |
state
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
categorical |
clinton16 |
numerical, continuous |
trump16 |
numerical, continuous |
dem16 |
categorical |
state |
categorical |
party18 |
|
dem18 |
|
flip18 |
|
gerry |
party18
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
categorical |
clinton16 |
numerical, continuous |
trump16 |
numerical, continuous |
dem16 |
categorical |
state |
categorical |
party18 |
categorical |
dem18 |
|
flip18 |
|
gerry |
dem18
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
categorical |
clinton16 |
numerical, continuous |
trump16 |
numerical, continuous |
dem16 |
categorical |
state |
categorical |
party18 |
categorical |
dem18 |
categorical |
flip18 |
|
gerry |
flip18
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
categorical |
clinton16 |
numerical, continuous |
trump16 |
numerical, continuous |
dem16 |
categorical |
state |
categorical |
party18 |
categorical |
dem18 |
categorical |
flip18 |
categorical |
gerry |
gerry
| Variable | Type |
|---|---|
district |
categorical, ID |
last_name |
categorical, ID |
first_name |
categorical, ID |
party16 |
categorical |
clinton16 |
numerical, continuous |
trump16 |
numerical, continuous |
dem16 |
categorical |
state |
categorical |
party18 |
categorical |
dem18 |
categorical |
flip18 |
categorical |
gerry |
categorical, ordinal |
Analyzing a single variable:
Numerical: histogram, box plot, density plot, etc.
Categorical: bar plot, pie chart, etc.
The middle of the box is the median. 50% of the data are below, and 50% are above:
The lower edge of the box is the 25% quantile. 25% of the data are below, and 75% are above:
The upper edge of the box is the 75% quantile. 75% of the data are below, and 25% are above:
?geom_boxplot):The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called “outlying” points and are plotted individually.
Same box plot:
Very different distributions:
Prettier. Smooths out the lumps and bumps. There are still defaults you could learn to override.
# A tibble: 1 × 6
mean_trump_perc median_trump_perc sd iqr q25 q75
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 45.9 48.7 16.8 23.3 34.8 58.1
# A tibble: 1 × 2
gimme_my_mean gimme_my_median
<dbl> <dbl>
1 45.9 48.7
summarize creates a new data frame that stores the summaries;R code that computes the summaries. You must use the correct command names (case sensitive): mean, median, quantile, sd, var, etc;?quantile).Describe the distribution of percent of vote received by Trump in 2016 Presidential Election from Congressional Districts.
Shape: The distribution of votes for Trump in the 2016 election from Congressional Districts is unimodal and left-skewed.
Center: The percent of vote received by Trump in the 2016 Presidential Election from a typical Congressional Districts is 48.7%.
Spread: In the middle 50% of Congressional Districts, 34.8% to 58.1% of voters voted for Trump in the 2016 Presidential Election.
Unusual observations: -
Analyzing the relationship between two variables:
Numerical + numerical: scatterplot
Numerical + categorical: side-by-side box plots, violin plots, etc.
Categorical + categorical: stacked bar plots
Using an aesthetic (e.g., fill, color, shape, etc.) or facets to represent the second variable in any plot
# A tibble: 3 × 6
gerry min q25 median q75 max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 low 4.9 36.3 48.4 54.7 74.9
2 mid 6.8 34.8 48.0 57.9 79.9
3 high 9.2 33.5 50.5 60.8 80.4
Analyzing the relationship between multiple variables:
In general, one variable is identified as the outcome of interest
The remaining variables are predictors or explanatory variables
Plots for exploring multivariate relationships are the same as those for bivariate relationships, but conditional on one or more variables
y vs. x1, colored by x2, faceted by x3)Summary statistics for exploring multivariate relationships are the same as those for bivariate relationships, but conditional on one or more variables
y and x1, grouped by levels of x2 and x3)Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-04-gerrymander-explore-I.qmd.
Work through the application exercise in class, and render, commit, and push your edits by the end of class.