Lecture 11
Duke University
STA 199 Spring 2026
2026-02-18
How old were you on February 11 2013?
Scan the QR code or go HERE. Log in with your Duke NetID.
For today!
Download Chrome web browser if you don’t have it already, and add the SelectorGadget to the bookmarks bar.
Measuring how long it takes for something to happen:

How do I…get this?
Because the data are already formatted as an honest-to-goodness table, you can successfully highlight the whole table with your cursor, copy it, paste it into an empty Excel spreadsheet, and then clean it by hand.
BUT:
Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
Two different scenarios:
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
Most of the data on the web is still largely available as HTML - while it is structured (hierarchical) it often is not available in a form useful for analysis (flat / tidy).
<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
<br/>
<div class="name" id="first">John</div>
<div class="name" id="last">Doe</div>
<div class="contact">
<div class="home">555-555-1234</div>
<div class="home">555-555-2345</div>
<div class="work">555-555-9999</div>
<div class="fax">555-555-8888</div>
</div>
</body>
</html>|>
Core functions:
read_html() - read HTML data from a url or character string.
html_elements() - select specified elements from the HTML document using CSS selectors (or xpath).
html_element() - select a single element from the HTML document using CSS selectors (or xpath).
html_table() - parse an HTML table into a data frame.
html_text() / html_text2() - extract tag’s text content.
html_name - extract a tag/element’s name(s).
html_attrs - extract all attributes.
html_attr - extract attribute value(s) by name.
html <-
'<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
<br/>
<div class="name" id="first">John</div>
<div class="name" id="last">Doe</div>
<div class="contact">
<div class="home">555-555-1234</div>
<div class="home">555-555-2345</div>
<div class="work">555-555-9999</div>
<div class="fax">555-555-8888</div>
</div>
</body>
</html>'{xml_nodeset (7)}
[1] <div class="name" id="first">John</div>
[2] <div class="name" id="last">Doe</div>
[3] <div class="contact">\n <div class="home">555-555-1234</div>\n ...
[4] <div class="home">555-555-1234</div>
[5] <div class="home">555-555-2345</div>
[6] <div class="work">555-555-9999</div>
[7] <div class="fax">555-555-8888</div>
| Selector | Example | Description |
|---|---|---|
| .class | .title |
Select all elements with class=“title” |
| #id | #name |
Select all elements with id=“name” |
| element | p |
Select all <p> elements |
| element element | div p |
Select all <p> elements inside a <div> element |
| element>element | div > p |
Select all <p> elements with <div> as a parent |
| [attribute] | [class] |
Select all elements with a class attribute |
| [attribute=value] | [class=title] |
Select all elements with class=“title” |
{xml_nodeset (2)}
[1] <div class="name" id="first">John</div>
[2] <div class="name" id="last">Doe</div>
html_text() vs. html_text2()
[1] " \n This is the first sentence in the paragraph.\n This is the second sentence that should be on the same line as the first sentence.This third sentence should start on a new line.\n "
[1] "This is the first sentence in the paragraph. This is the second sentence that should be on the same line as the first sentence.\nThis third sentence should start on a new line."
html_table()
This is the easy part!
{html_document}
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
[1] <head>\n<link rel="home" href="https://www.catholic-hierarchy.org/">\n<me ...
[2] <script src="/cdn-cgi/scripts/7d0fa10a/cloudflare-static/rocket-loader.mi ...
[3] <body bgcolor="#DCFFFF" text="#000000" link="#0000FF" alink="#FF0000" vli ...
Already pretty good!
Rows: 268
Columns: 12
$ `#` <chr> "#", "267", "266", "265", "264", "263", "262", "261", "…
$ PapalName <chr> "PapalName", "Leo XIV", "Francis", "Benedict XVI", "Joh…
$ BirthDate <chr> "BirthDate", "14 Sep 1955", "17 Dec 1936", "16 Apr 1927…
$ ConclaveStart <chr> "ConclaveStart", "7 May 2025", "12 Mar 2013", "18 Apr 2…
$ Elected <chr> "Date", "8 May 2025", "13 Mar 2013", "19 Apr 2005", "16…
$ Elected <chr> "Age", "69.6", "76.2", "78.0", "58.4", "65.8", "65.7", …
$ Installed <chr> "Date", "18 May 2025", "19 Mar 2013", "24 Apr 2005", "2…
$ Installed <chr> "Age", "69.6", "76.2", "78.0", "58.4", "65.8", "65.7", …
$ `End of Reign` <chr> "Date", "( Age: <!--\r\nwriteage(1955,9,14);\r\n//-->70…
$ `End of Reign` <chr> "Age", "( Age: <!--\r\nwriteage(1955,9,14);\r\n//-->70.…
$ Length <chr> "Elected", "<!--\r\nwriteage(2025,5,8);\r\n//-->.7", "1…
$ Length <chr> "Installed", "<!--\r\nwriteage(2025,5,18);\r\n//-->.7",…
names(popes_table) <- make.unique(names(popes_table))
popes_clean <- popes_table |>
rename(
papal_number = `#`,
papal_name = `PapalName`,
birth_date = `BirthDate`,
conclave_start = `ConclaveStart`,
date_elected = `Elected`,
age_elected = `Elected.1`,
date_installed = `Installed`,
age_installed = `Installed.1`,
date_end = `End of Reign`,
age_end = `End of Reign.1`,
years_elected = `Length`,
years_installed = `Length.1`,
)
glimpse(popes_clean)Rows: 268
Columns: 12
$ papal_number <chr> "#", "267", "266", "265", "264", "263", "262", "261", …
$ papal_name <chr> "PapalName", "Leo XIV", "Francis", "Benedict XVI", "Jo…
$ birth_date <chr> "BirthDate", "14 Sep 1955", "17 Dec 1936", "16 Apr 192…
$ conclave_start <chr> "ConclaveStart", "7 May 2025", "12 Mar 2013", "18 Apr …
$ date_elected <chr> "Date", "8 May 2025", "13 Mar 2013", "19 Apr 2005", "1…
$ age_elected <chr> "Age", "69.6", "76.2", "78.0", "58.4", "65.8", "65.7",…
$ date_installed <chr> "Date", "18 May 2025", "19 Mar 2013", "24 Apr 2005", "…
$ age_installed <chr> "Age", "69.6", "76.2", "78.0", "58.4", "65.8", "65.7",…
$ date_end <chr> "Date", "( Age: <!--\r\nwriteage(1955,9,14);\r\n//-->7…
$ age_end <chr> "Age", "( Age: <!--\r\nwriteage(1955,9,14);\r\n//-->70…
$ years_elected <chr> "Elected", "<!--\r\nwriteage(2025,5,8);\r\n//-->.7", "…
$ years_installed <chr> "Installed", "<!--\r\nwriteage(2025,5,18);\r\n//-->.7"…
Rows: 267
Columns: 12
$ papal_number <chr> "267", "266", "265", "264", "263", "262", "261", "260"…
$ papal_name <chr> "Leo XIV", "Francis", "Benedict XVI", "John Paul II", …
$ birth_date <chr> "14 Sep 1955", "17 Dec 1936", "16 Apr 1927", "18 May 1…
$ conclave_start <chr> "7 May 2025", "12 Mar 2013", "18 Apr 2005", "14 Oct 19…
$ date_elected <chr> "8 May 2025", "13 Mar 2013", "19 Apr 2005", "16 Oct 19…
$ age_elected <chr> "69.6", "76.2", "78.0", "58.4", "65.8", "65.7", "76.9"…
$ date_installed <chr> "18 May 2025", "19 Mar 2013", "24 Apr 2005", "22 Oct 1…
$ age_installed <chr> "69.6", "76.2", "78.0", "58.4", "65.8", "65.7", "76.9"…
$ date_end <chr> "( Age: <!--\r\nwriteage(1955,9,14);\r\n//-->70.3\r\n)…
$ age_end <chr> "( Age: <!--\r\nwriteage(1955,9,14);\r\n//-->70.3\r\n)…
$ years_elected <chr> "<!--\r\nwriteage(2025,5,8);\r\n//-->.7", "12.1", "7.8…
$ years_installed <chr> "<!--\r\nwriteage(2025,5,18);\r\n//-->.7", "12.0", "7.…
popes_clean <- popes_clean |>
mutate(
papal_number = as.integer(papal_number),
age_elected = as.numeric(age_elected),
age_installed = as.numeric(age_installed),
age_end = as.numeric(age_end),
years_elected = as.numeric(years_elected),
years_installed = as.numeric(years_installed),
conclave_start = as.Date(conclave_start, format = "%d %b %Y"),
date_elected = as.Date(date_elected, format = "%d %b %Y"),
birth_date = as.Date(birth_date, format = "%d %b %Y"),
date_end = ifelse(papal_name == "Leo XIV", NA, date_end),
resigned = str_detect(date_end, "#"),
date_end = str_remove_all(date_end, "#"),
date_end = as.Date(date_end, format = "%d %b %Y"),
not_bishop = str_detect(date_installed, "\\*"),
date_installed = str_remove_all(date_installed, "\\*"),
date_installed = as.Date(date_installed, format = "%d %b %Y")
)
glimpse(popes_clean)Rows: 267
Columns: 14
$ papal_number <int> 267, 266, 265, 264, 263, 262, 261, 260, 259, 258, 257,…
$ papal_name <chr> "Leo XIV", "Francis", "Benedict XVI", "John Paul II", …
$ birth_date <date> 1955-09-14, 1936-12-17, 1927-04-16, 1920-05-18, 1912-…
$ conclave_start <date> 2025-05-07, 2013-03-12, 2005-04-18, 1978-10-14, 1978-…
$ date_elected <date> 2025-05-08, 2013-03-13, 2005-04-19, 1978-10-16, 1978-…
$ age_elected <dbl> 69.6, 76.2, 78.0, 58.4, 65.8, 65.7, 76.9, 63.0, 64.6, …
$ date_installed <date> 2025-05-18, 2013-03-19, 2005-04-24, 1978-10-22, 1978-…
$ age_installed <dbl> 69.6, 76.2, 78.0, 58.4, 65.8, 65.7, 76.9, 63.0, 64.6, …
$ date_end <date> NA, 2025-04-21, 2013-02-28, 2005-04-02, 1978-09-28, 1…
$ age_end <dbl> NA, 88.3, 85.8, 84.8, 65.9, 80.8, 81.5, 82.6, 81.6, 67…
$ years_elected <dbl> NA, 12.1, 7.8, 26.4, 0.0, 15.1, 4.5, 19.6, 17.0, 7.3, …
$ years_installed <dbl> NA, 12.0, 7.8, 26.4, 0.0, 15.1, 4.5, 19.5, 16.9, 7.3, …
$ resigned <lgl> NA, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ not_bishop <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
Let’s measure the length of each conclave in days:
Rows: 267
Columns: 15
$ conclave_length_days <int> 1, 1, 1, 2, 1, 2, 3, 1, 4, 3, 4, 2, 2, 50, 35, 26…
$ papal_number <int> 267, 266, 265, 264, 263, 262, 261, 260, 259, 258,…
$ papal_name <chr> "Leo XIV", "Francis", "Benedict XVI", "John Paul …
$ birth_date <date> 1955-09-14, 1936-12-17, 1927-04-16, 1920-05-18, …
$ conclave_start <date> 2025-05-07, 2013-03-12, 2005-04-18, 1978-10-14, …
$ date_elected <date> 2025-05-08, 2013-03-13, 2005-04-19, 1978-10-16, …
$ age_elected <dbl> 69.6, 76.2, 78.0, 58.4, 65.8, 65.7, 76.9, 63.0, 6…
$ date_installed <date> 2025-05-18, 2013-03-19, 2005-04-24, 1978-10-22, …
$ age_installed <dbl> 69.6, 76.2, 78.0, 58.4, 65.8, 65.7, 76.9, 63.0, 6…
$ date_end <date> NA, 2025-04-21, 2013-02-28, 2005-04-02, 1978-09-…
$ age_end <dbl> NA, 88.3, 85.8, 84.8, 65.9, 80.8, 81.5, 82.6, 81.…
$ years_elected <dbl> NA, 12.1, 7.8, 26.4, 0.0, 15.1, 4.5, 19.6, 17.0, …
$ years_installed <dbl> NA, 12.0, 7.8, 26.4, 0.0, 15.1, 4.5, 19.5, 16.9, …
$ resigned <lgl> NA, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
$ not_bishop <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
# A tibble: 1 × 3
mean_length median_length sd_length
<dbl> <dbl> <dbl>
1 30.6 7.5 43.4
# A tibble: 32 × 3
conclave_length_days n prop
<int> <int> <dbl>
1 1 9 0.141
2 2 7 0.109
3 3 8 0.125
4 4 4 0.0625
5 5 2 0.0312
6 6 1 0.0156
7 7 1 0.0156
8 8 3 0.0469
9 12 1 0.0156
10 13 1 0.0156
# ℹ 22 more rows
How often do you read The Chronicle?
Scan the QR code or go HERE. Log in with your Duke NetID.
What do you think is the most common word in the titles of The Chronicle opinion pieces?
How do you think the sentiments in opinion pieces in The Chronicle compare across authors? Roughly the same? Wildly different? Somewhere in between?
(mostly) with tools you already know!
Code for the earlier plot:
stop_words <- read_csv("data/stop-words.csv")
chronicle |>
tidytext::unnest_tokens(word, title) |>
mutate(word = str_replace_all(word, "’", "'")) |>
anti_join(stop_words) |>
count(word, sort = TRUE) |>
filter(word != "duke's") |>
slice_head(n = 20) |>
mutate(word = fct_reorder(word, n)) |>
ggplot(aes(y = word, x = n, fill = log(n))) +
geom_col(show.legend = FALSE) +
theme_minimal(base_size = 16) +
labs(
x = "Number of mentions",
y = "Word",
title = "The Chronicle - Opinion pieces",
subtitle = "Common words in the 500 most recent opinion piece titles",
caption = "Source: Data scraped from The Chronicle on Oct 6, 2025"
) +
theme(
plot.title.position = "plot",
plot.caption = element_text(color = "gray30")
)Code for the earlier plot:
afinn_sentiments <- read_csv("data/afinn-sentiments.csv")
chronicle |>
tidytext::unnest_tokens(word, title, drop = FALSE) |>
mutate(word = str_replace_all(word, "’", "'")) |>
anti_join(stop_words) |>
left_join(afinn_sentiments) |>
group_by(author, title) |>
summarize(total_sentiment = sum(value, na.rm = TRUE), .groups = "drop") |>
group_by(author) |>
summarize(
n_articles = n(),
avg_sentiment = mean(total_sentiment, na.rm = TRUE),
) |>
filter(n_articles > 2 & !is.na(author)) |>
arrange(desc(avg_sentiment)) |>
slice(c(1:10, 30:39)) |>
mutate(
author = fct_reorder(author, avg_sentiment),
neg_pos = if_else(avg_sentiment < 0, "neg", "pos"),
label_position = if_else(neg_pos == "neg", 0.25, -0.25)
) |>
ggplot(aes(y = author, x = avg_sentiment)) +
geom_col(aes(fill = neg_pos), show.legend = FALSE) +
geom_text(
aes(x = label_position, label = author, color = neg_pos),
hjust = c(rep(1, 10), rep(0, 10)),
show.legend = FALSE,
fontface = "bold"
) +
geom_text(
aes(label = round(avg_sentiment, 1)),
hjust = c(rep(1.25, 10), rep(-0.25, 10)),
color = "white",
fontface = "bold"
) +
scale_fill_manual(values = c("neg" = "#4d4009", "pos" = "#FF4B91")) +
scale_color_manual(values = c("neg" = "#4d4009", "pos" = "#FF4B91")) +
coord_cartesian(xlim = c(-2, 2)) +
labs(
x = "negative ← Average sentiment score (AFINN) → positive",
y = NULL,
title = "The Chronicle - Opinion pieces\nAverage sentiment scores of titles by author",
subtitle = "Top 10 average positive and negative scores",
caption = "Source: Data scraped from The Chronicle on Sep 30, 2024"
) +
theme_void(base_size = 16) +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(
hjust = 0.5,
margin = unit(c(0.5, 0, 1, 0), "lines")
),
axis.title.x = element_text(color = "gray30", size = 12),
plot.caption = element_text(color = "gray30", size = 10)
)# A tibble: 500 × 7
title author date_time month day column url
<chr> <chr> <dttm> <chr> <dbl> <chr> <chr>
1 The United States is a f… Noor … 2025-10-07 10:00:00 Oct 7 Opini… http…
2 The problem with censors… Harri… 2025-10-07 10:00:00 Oct 7 Campu… http…
3 The 'Duke Difference' we… Gabri… 2025-10-06 14:30:00 Oct 6 Campu… http…
4 Death ain’t nothin’ but … Luke … 2025-10-06 10:00:00 Oct 6 Campu… http…
5 Hazing ban forces Greek … Monda… 2025-10-06 04:00:00 Oct 6 Campu… http…
6 Duke’s hold on Durham Lucas… 2025-10-04 10:00:00 Oct 4 Campu… http…
7 The world needs more Jan… Leo G… 2025-10-03 10:00:00 Oct 3 Campu… http…
8 We’ve grown the future o… Kayle… 2025-10-02 14:00:00 Oct 2 Opini… http…
9 How Duke introduces you … Neel … 2025-10-01 10:00:00 Oct 1 Campu… http…
10 Why aren’t we all excite… Ryan … 2025-10-01 10:00:00 Oct 1 Campu… http…
# ℹ 490 more rows
SelectorGadget (selectorgadget.com) is a javascript based tool that helps you interactively build an appropriate CSS selector for the content you are interested in.
Go to https://www.dukechronicle.com/section/opinion?page=1&per_page=500.
How many articles are on the page?

When working in a Quarto document, your analysis is re-run each time you knit
If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!
An alternative workflow:
Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-11-chronicle-scrape.qmd and chronicle-scrape.R.