HW 6

HW
Due: Mon, Apr 6, 11:59 pm

Introduction

All questions are AI enabled

This homework covers topics that will appear on next week’s midterm. Naturally, you will not get human-graded feedback before that time, and so all of the questions are AI-enabled so that you have something to study from. Then later, the TAs will go in and properly grade the second half as usual.

This is a two-part homework assignment:

  • Part 1 – 🤖 Feedback from AI: Not graded, for practice, you get immediate feedback with AI, based on rubrics designed by the course instructor. Complete in hw-6-part-1.qmd, no submission required.

  • Part 2 – 🧑🏽‍🏫 Feedback from AI, and then Humans later: First you get immediate feedback with AI, based on rubrics designed by the course instructor. Then in about a week, you get graded feedback from the course instructional team. Complete in hw-6-part-2.qmd, submit on Gradescope.

By now you should be familiar with how to get started with a homework assignment by cloning the GitHub repo for the assignment.

Click to expand if you need a refresher on how to get started with a homework assignment.
  • Go to https://cmgr.oit.duke.edu/containers and login with your Duke NetID and Password.
  • Click STA199 under My reservations to log into your container. You should now see the RStudio environment.
  • Go to the course organization at github.com/sta199-s26 organization on GitHub. Click on the repo with the prefix hw-6. It contains the starter documents you need to complete the homework.
  • Click on the green CODE button, select Use SSH. Click on the clipboard icon to copy the repo URL.
  • In RStudio, go to FileNew ProjectVersion ControlGit.
  • Copy and paste the URL of your assignment repo into the dialog box Repository URL. Again, please make sure to have SSH highlighted under Clone when you copy the address.
  • Click Create Project, and the files from your GitHub repo will be displayed in the Files pane in RStudio.

By now you should also be familiar with guidelines for formatting your code and plots as well as your Git and Gradescope workflow.

Click to expand if you need a refresher on assignment guidelines.

Code

Code should follow the tidyverse style. Particularly,

  • there should be spaces before and line breaks after each + when building a ggplot,
  • there should also be spaces before and line breaks after each |> in a data transformation pipeline,
  • code should be properly indented,
  • there should be spaces around = signs and spaces after commas.

Additionally, all code should be visible in the PDF output, i.e., should not run off the page on the PDF. Long lines that run off the page should be split across multiple lines with line breaks.

Plots

  • Plots should have an informative title and, if needed, also a subtitle.
  • Axes and legends should be labeled with both the variable name and its units (if applicable).
  • Careful consideration should be given to aesthetic choices.

Workflow

Continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course.

  • You should have at least 3 commits with meaningful commit messages by the end of the assignment.
  • Final versions of both your .qmd file and the rendered PDF should be pushed to GitHub.

Packages

In this assignment you will work with the

  • tidyverse package for doing data analysis in a “tidy” way, and
  • tidymodels package for modeling in a “tidy” way
  • openintro package for the spam e-mail dataset

Part 1 – Feedback from AI

Your answers to the questions in this part should go in the file hw-6-part-1.qmd.

Instructions

Write your answer to each question in the appropriate section of the hw-6-part-1.qmd file. Then, highlight your answer to a question, click on Addins > AIFEEDR > Get feedback. In the app that opens, select the appropriate homework number (4) and question number. Then click on Get Feedback. Please be patient, feedback generation can take a few seconds. Once you read the feedback, you can go back to your Quarto document to improve your answer based on the feedback. You will then need to click the red X on the top left corner of the Viewer pane to stop the feedback app from running before you can re-render your Quarto document.

Click to expand if you want to review the video that demonstrates how to use the AI feedback tool.

Do you even lift?

In this part, you will be working with data from www.openpowerlifting.org. This data was sourced from Tidy Tuesday and contains international powerlifting records at various meets. At each meet, each lifter gets three attempts at lifting max weight on three lifts: the bench press, squat and deadlift. For all of the following exercises, you should include units on axes labels, e.g. “Bench press (lbs)” or “Bench press (kg)”. “Age (years)” etc. This is good practice.

Question 1

  1. Let’s begin by taking a look at the squat lifting records.

    • Read in the ipf.csv file that is in your data folder and save it as ipf.
    • First, remove any observations that are negative for squat.
    • Next, create a new column called best3_squat_lbs that converts the record from kilograms to pounds.
    • Save your data frame as ipf_squat.
    • Report the number of rows and columns of this new data frame.
Note

You may need to google the conversion formula.

  1. Using the ipf_squat data frame you created in part (a), create a scatter plot to investigate the relationship between squat (in lbs) and age.

    • Age should be on the x-axis.
    • Adjust the alpha level of your points to get a better sense of the density of the data.
    • Add a linear trend-line.
    • Summarize the trend you observe in at most 4 sentences.
    • Write down the linear population model to predict lift squat lbs from age.
    • Fit the linear model, and save it as age_fit.
    • Re-write your previous equation replacing the population parameters with the numeric estimates. This is called the “fitted” linear model.
    • Interpret each estimate of \(\beta\), and comment on whether the interpretations are sensible

Question 2

  1. Building on your ipf_squat data frame, create a new column called age2 that takes the age of each lifter and squares it. Save it to your data frame ipf_squat. Next, plot squat in lbs vs age2 and add a linear best fit line. How does the fit of the model compare to the one from earlier?

  2. One metric to assess the fit of a model is the correlation squared, also known as \(R^2\). Fit the age\(^2\) model and save the object as age2_fit. Compare \(R^2\) of the new model (squat vs. age\(^2\)) to the \(R^2\) of the earlier model (squat vs. age). Which has a higher \(R^2\)?

General Social Survey

The General Social Survey (GSS) gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviours, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years. The GSS contains a standard core of demographic, behavioural, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events. In this part you will work with variables from the 2022 General Social Survey.

Question 3

  1. Read in the gss22.csv file that is in your data folder and save it as gss22. Report its number of rows and columns.

  2. Create a new data frame called gss22_advfront that only contains the variables advfront, educ, and polviews. Then, use the drop_na() function to remove rows that contain NAs from this new data frame. Report the number of rows and columns of gss22_advfront. Additionally, report what percent of the observations were discarded at this step.

  3. Re-level the advfront variable such that it has two levels: “Strongly agree” and “Agree” combined into a new level called Agree and the remaining levels combined into ”Not agree”. Then, re-order the levels in the following order: “Agree” and “Not agree”. Finally, count() how many times each new level appears in the advfront variable.

Tip

You can do this in various ways. One option is to use the str_detect() function to detect the existence of words. Note that these sometimes show up with lowercase first letters and sometimes with upper case first letters. To detect either in the str_detect() function, you can use “[Aa]gree”. However, solve the problem however you like, this is just one option.

  1. Combine the levels of the polviews variable such that levels that have the word “liberal” in them are lumped into a level called “Liberal” and those that have the word conservative in them are lumped into a level called “Conservative”. Then, re-order the levels in the following order: “Conservative” , “Moderate”, and “Liberal”. Finally, count() how many times each new level appears in the polviews variable.

Question 4

a. Fit a logistic regression model that predicts advfront from educ. Report the tidy output of the model.

b. Write out the estimated model in proper notation.

c. Using your estimated model, predict the probability of agreeing with the statement “Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.” (Agree in advfront) for someone with 7 years of education.

Question 5

a. Fit a model that adds the additional explanatory variable of polviews to your model from Question 9. Report the tidy output of the model.

b. Now, predict the probability of agreeing with the following statement “Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.” (Agree in advfront) for a Conservative person with 7 years of education.

Part 2 – Feedback from AI, and then Humans later

Your answers to the questions in this part should go in the file hw-6-part-2.qmd.

Gapminder, again

Gapminder is a “fact tank” that uses publicly available world data to produce data visualizations and teaching resources on global development. We will use an excerpt of their data to explore relationships among world health metrics across countries and regions between the years 2000 and 2023. The data set is called gapminder and it’s in your HW repository’s data folder.

The next two questions rely on work you did back on HW 4, where the gapminder dataframe is read in and transformed into gapminder_23. This dataframe can be re-created with the code below:

gapminder_raw <- read_csv("data/gapminder.csv")

gapminder_raw_23 <- gapminder_raw |>
  filter(year == 2023)

gapminder_23 <- gapminder_raw_23 |> 
  mutate(
    gdp_percap = if_else(
      str_detect(gdp_percap, "k"),                      
      as.numeric(str_replace(gdp_percap, "k", "")) * 1000,  
      as.numeric(gdp_percap)                            
    )
  )

Question 6

  1. Data Prep: What happens when you take the natural log of 0? Remove rows of your data set where life_expectancy = 0, justifying why this is helpful.

  2. Model fitting: Fit a linear model predicting the natural log of life expectancy from gross domestic product. Display the tidy summary.

  3. Model evaluation:

    • Calculate the R-squared of the model using two methods and confirm that the values match: first method is using the glance() function and the other method is based on the value of the correlation coefficient between the two variables.

    • Interpret R-squared in the context of the data and the research question.

Question 7

Next, we want to examine if the relationship between life expectancy and GDP that we observed in the previous exercise holds across all continents in our data. We’ll continue to work with logged life expectancy (life_exp_log) and data from 2023.

  1. Justification: Create a scatter plot of life_exp_log vs. gdp_percap, where the points are colored by continent. Do you think the trend between life_exp_log and gdp_percap is different for different continents? Justify your answer with specific features of the plot.

  2. Model fitting and interpretation:

    • Regardless of your answer in part (a), fit an additive model (main effects) that predicts life_exp_log from GDP per capita and continent (with Africas as the baseline level). Display a tidy summary of the model output.

    • Interpret the intercept of the model, making sure that your interpretation is in the units of the original data (not on log scale).

    • Interpret the slope of the model, making sure that your interpretation is in the units of the original data (not on log scale).

  3. Prediction: Predict the life expectancy of a country in Asia where the average GDP per capita is $70,000. Do this using R functions, not by manually plugging in numbers.

Spam spam spam spam

The data come from incoming emails in David Diez’s (one of the authors of OpenIntro textbooks) Gmail account for the first three months of 2012. All personally identifiable information has been removed. The dataset is called email and it’s in the openintro package.

The outcome variable is spam, which takes the value 1 if the email is spam, 0 otherwise.

Question 8

  1. What type of variable is spam? What percent of the emails are spam?

  2. What type of variable is dollar - number of times a dollar sign or the word “dollar” appeared in the email? Visualize and describe its distribution, supporting your description with the appropriate summary statistics.

  3. Fit a logistic regression model predicting spam from dollar. Then, display the tidy output of the model.

  4. Using this model and the predict() function, predict the probability the email is spam if it contains 5 dollar signs. Based on this probability, how does the model classify this email?

Note

To obtain the predicted probability, you can set the type argument in predict() to “prob”.

Question 9

  1. Fit a logistic regression model predicting spam from dollar, winner (indicating whether “winner” appeared in the email), and urgent_subj (whether the word “urgent” is in the subject of the email). Then, display the tidy output of the model. Include the two predictors using an additive model, not an interaction model.

  2. Using this model, predict spam / not spam for all emails in the email dataset with augment(). Store the resulting data frame with an appropriate name and display the data frame as well.

  3. Using your data frame from the previous part, determine, in a single pipeline, and using count(), the numbers of emails:

    • that are labelled as spam that are actually spam
    • that are not labelled as spam that are actually spam
    • that are labelled as spam that are actually not spam
    • that are not labelled as spam that are actually not spam

Store the resulting data frame with an appropriate name.

  1. In a single pipeline, and using mutate(), calculate the false positive and false negative rates.

In addition to these numbers showing in your R output, you must write a sentence that explicitly states and identifies the two rates.

Question 10

  1. Fit a logistic regression model predicting spam from dollar and another variable you think would be a good predictor. Again, inclue the multiple predictors with an additive model. No interactions. Provide a 1-sentence justification for why you chose this variable. Display the tidy output of the model.

  2. Using this model, predict spam / not spam for all emails in the email dataset with augment(). Store the resulting data frame with an appropriate name.

  3. Using your data frame from the previous part, determine, in a single pipeline, and using count(), the numbers of emails:

    • that are labelled as spam that are actually spam
    • that are not labelled as spam that are actually spam
    • that are labelled as spam that are actually not spam
    • that are not labelled as spam that are actually not spam

    Store the resulting data frame with an appropriate name.

  4. In a single pipeline, and using mutate(), calculate the false positive and false negative rates. In addition to these numbers showing in your R output, you must write a sentence that explicitly states and identifies the two rates.

  5. Based on the false positive and false negatives rates of this model, comment, in 1-2 sentences, on which model (one from Question 9 or Question 10) is preferable and why.

Wrap-up

Warning

Before you wrap up the assignment, make sure that you render, commit, and push one final time so that the final versions of both your .qmd file and the rendered PDF are pushed to GitHub and your Git pane is empty. We will be checking these to make sure you have been practicing how to commit and push changes.

Submission

Submit your PDF document to Gradescope by the deadline to be considered “on time”:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
  • Click on your STA 199 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with question. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Checklist

Make sure you have:

  • attempted all questions
  • rendered your Quarto document
  • committed and pushed everything to your GitHub repository such that the Git pane in RStudio is empty
  • uploaded your PDF to Gradescope

Grading and feedback

  • Questions 1-5 are not graded, but you should complete them to get practice.

  • Questions 6-10 are graded, and you will receive feedback on Gradescope from the course instructional team in about a week.

    • Questions will be graded for accuracy and completeness.
    • Partial credit will be given where appropriate.
    • There are also workflow points for:
      • committing at least three times as you work through your homework,
      • having your final version of .qmd and .pdf files in your GitHub repository, and
      • overall organization.