
Midterm 1 Study Guide Solutions
-
Rapid-fire:
- every row is an observation and every column is a variable;
- given the data and the code, we can exactly reproduce all of the figures and tables in the analysis;
- both lists and vector are ways of bundling together multiple values in R. But a vector consists of values of all the same type, whereas a list can bundle different types;
- STAAAAAAAWANANAAAAA;
- It takes the object on the left and feeds it forward as the first argument to the command on the right;
- It tells the computer what columns in the data frame to use and how to use them;
-
if_elsehandles two conditions;case_whenhandles multiple conditions; -
filterdiscards rows andselectdiscards columns; - adds a “grouping” attribute to a data frame so that subsequent operations are performed within the levels of a categorical variable;
- it’s the cutoff point on the number line that has 38% of your data values below and 62% of the values above;
- yes I could;
- smaller than.
e
d
b
a
c
f
f
a
b
e
c
d
Missing
+betweenggplotandgeom_point;vsis numeric type, but theshapeaesthetic requires a factor;mtcarsneeds to be overwritten to save the new variableheavy;There’s a hanging comma inside the
valuesargument toscale_color_manual;colorandshapehave different labels.shapeis defaulting to the column name “species”, andcoloris using the user-specified “Species”. It’s case sensitive. Addshape = "Species"insidelabs, and the legends are consolidated;n_peopleis not a column in the summary table.age_sumis a column in the summary table.The wider data frame has a column titled
Remain Against The Law, which is improper because of the spaces. To reference that column by name, you need to put it in back ticks, no quotes;Inside the
if_else, we are mixing types: characters and numbers;We are supplying
ggplotwith the data twice, by piping it in and supplying the data frame directly. It needs to be one or the other;We are assigning the entire pipeline to the variable name
us_uk_tr_votes. The pipeline ends with the creation of a plot, so this new variable is not a data frame, it’s a plot object. You can’t filter that;b, c, f, g -
- The `blizzard_salary` dataset has 409 rows.
- The `percent_incr` variable is numerical and continuous.
- The `salary_type` variable is categorical.
Figure 1 - A shared x-axis makes it easier to compare summary statistics for the variable on the x-axis.
c - It’s a value higher than the median for hourly but lower than the mean for salaried.
b - There is more variability around the mean compared to the hourly distribution.
a, b, e - Pie charts and waffle charts are for visualizing distributions of categorical data only. Scatterplots are for visualizing the relationship between two numerical variables.
c -
mutate()is used to create or modify a variable.a -
"Poor", "Successful", "High", "Top"b - Option 2. The plot in Option 1 shows the number of employees with a given performance rating for each salary type while the plot in Option 2 gives the proportion of employees with a given performance rating for each salary type. In order to assess the relationship between these variables (e.g., how much more likely is a Top rating among Salaried vs. Hourly workers), we need the proportions, not the counts.
There may be some
NAs in these two variables that are not visible in the plot.The proportions under Hourly would go in the Hourly bar, and those under Salaried would go in the Salaried bar.
c -
filter(salary_type != "Hourly" & performance_rating == "Poor")- There are 5 observations for “not Hourly” “and” Poor.a -
arrange()- The result is arranged in increasing order ofannual_salary, which is the default forarrange().c, d, e, f.
-
Part 1: The following should be fixed:
There should be a
|after#beforelabelThere should be a
:after label, not=There shouldn’t be a space in the chunk label, it should be
plot-blizzardThere should be spaces after commas in the code
There should be spaces on both sides of
=in the codeThere should be a space before
+geom_boxplot()should be on the next line and indentedThere should be a
+at the end of thegeom_boxplot()linelabs()should be indented
Part 2: The warning is caused by
NAin the data. It means that 39 observations wereNAs and are not plotted/represented on the plot. -
Part 1:
- Render: Run all of the code and render all of the text in the document and produce an output.
- Commit: Take a snapshot of your changes in Git with an appropriate message.
- Push: Send your changes off to GitHub.
Part 2: c - Rendering or committing isn’t sufficient to send your changes to your GitHub repository, a push is needed. A pull is also not needed to view the changes in the browser.
d
a, d
a, d
c
b
a
b
a, d
b, c, e
a
a, c, e
a, b, c, d
a, d
