Final Study Guide Solutions

Concepts and main ideas

Question 1

tidy data - every row is an observation and every column is a variable;
computational reproducibility - given the data and the code from an analysis, I can perfectly reproduce all of the tables and figures. That doesn’t mean the tables and figures are necessarily correct, nor does it mean that the results will replicate on further investigation (see the next item), but it means that we at least know exactly where the tables and figures came from. It’s a low bar, but you’d be shocked how often research fails to clear it;
scientific replicability - different researchers independently conduct the same analysis using different data and reach broadly the same conclusions;
observational data - data collected from the world without the researchers intervening to impose experimental controls;
quantile - for a numerical variable, its quantiles are the cutoff points on the numberline that have a certain percentage of the observations to the left or right. So the 36% quantile is the cutoff that has 36% of the values to the left and the remaining 64% to the right;
ROC curve - visualizes the classification accuracy of a logistic regression model across different classification thresholds. FP rate on the horizontal, TP rate on the vertical, and the points along the curve give you the (FP, TP) for “all” thresholds from 0 to 1;
discernibility level - the user-defined cutoff that determines whether the p-value is small enough to reject the null and accept the alternative;
sampling uncertainty - variation in estimates across alternative datasets;
bootstrap distribution - a distribution that describes how your estimates might vary over alternative datasets. Those alternative datasets are simulated using the bootstrap, which resamples the original data with replacement;
null distribution - a distribution that describes how your estimates might vary over alternative datasets if the null hypothesis were true;
p-value - the probability of getting an estimate as or more extreme than what you observed if the null is true;
type 1 error - rejected the null hypothesis when it’s true. In other words, a false positive;
type 2 error - failing to reject the null when it is false. In other words, a false negative.

A helpful mnemonic courtesy of the redoubtable Leah Johnson: “type 1, jump the gun; type 2, avoid the new.”

Visual understanding

Question 2

b - symmetric, centered at zero;
c - right-skewed, so mean is above the median;
d - symmetric, centered at -3;
e - right-skewed, so mean is above the median, but more spread than plot ii;
f - symmetric, centered at 2;
a - left-skewed, so mean is below the median.

Recall this.

Question 3

feature	quantity
I	k
II	c
III	d
IV	f
V	a
VI	g
VII	b
VIII	b
IX	b

Code doctor

The common variable name that you’re joining on is a different data type in the two data frames;
For geom_bar, you map your categorical variable (age in this case) to either the x or y aesthetic, and then leave the other one blank. Because the counts are already precomputed in the n column, you want geom_col here;
Inside full_join you want by = join_by(state == state_name);
There are two rows labeled Alice and Quiz1, so there are two values (90 and 95) that could go in the Alice/Quiz1 entry of the wide data frame, and the computer doesn’t know what you want;
The model doesn’t suck. We just screwed up the code. The “event” here is that the e-mail is spam, which is the second level of spam. Inside roc_curve, we accidentally supplied .pred_0, which are the probabilities for the base level (0, or not spam). This flips the graph backasswards from what it should be. You should supply .pred_1 instead so that it’s compatible with event_level = "second".

Stat doctor

First, they got the variables flipped. 69.66 is a prediction for speed when defense is zero. Second, our models deliver predictions, not guarantees. A defense of 0 is not guaranteed to yield a speed of 69.66. That’s just the prediction on average;
Holding attack and defense constant, a one unit increase in hp predicts an increase in the odds of being legendary by a multiplicative factor of \(e^{0.0274}\approx 1.027\). On average. Maybe. We think. IDFK;
This is just nonsense. 0.85 is the area under the ROC curve for a logistic regression model. It is not the \(R^2\) for a linear regression;
90% (not 95%) of the bootstrap estimates lie between -0.065 and 0.046. Not the data values;
The p-value is not the probability that the null is true. It is the probability of an estimate as or more extreme than the one you observed assuming the null is true;
A discernibility level of 50% is just crazy. You would never do that. Even if you did, the conclusion is worded strangely. We don’t reject the alternative hypothesis. You fail to reject the null. A large p-value is saying “the jury’s out.” The data are not speaking loudly enough to distinguish competing claims. So you aren’t coming down firmly on either side if the p-value is large.

Duke Forest homes

c
a
c
a
e
a
b, c, d
a, b, e
a, b, e

Holiday movies

b, e
b, c, d
d
e
c, e
a, b, c
a, b
e
a
e
b
a, c

GSS

Miscellaneous

b, d, e
a, e
b
b, c
a, b, c, e
d
a, b, d
b