Individual submissions due: October 2, 2018 @ 11:59pm
Obtain the Github repository you will use to complete homework 1, which contains a starter RMarkdown file named
homework_1.Rmd, which you will use to do your work and write-up when completing the questions below. Be sure to save, commit, and push (upload) frequently to Github so that you have incremental snapshots of your work. When you’re done, follow the How to submit section below to setup a Pull Request, which will be used for feedback.
Individual rough draft versus the group draft
Remember that you will be working through the homework assignment individually first (this is what is due by the end of the day on October 2nd), and later you will be assigned to a group where you will clean up the draft. As written in the course syllabus:
Homework assignments will completed in two stages. The first is the “rough draft” stage that each student must complete on their own. Grades for these submissions will be primarily based on the correctness of your answers to each question. The second is the “group draft” stage, which will begin approximately 3 days after the rough drafts are due. You will be assigned into groups where you will discuss the questions and assemble a final draft together. You will not know these groups beforehand, and they will change with each assignment. Grades for the group-submitted final drafts will, in addition to correctness, be based on document formatting, visualization quality, writing quality, and code style.
Both the individual rough draft and group report homework submissions are to be completed using the R Markdown format and they must successfully knit to HTML and PDF in a clean RStudio environment. In the individual submissions, full sentences are required when the question is asking for a written response. In the group reports, full sentences with proper grammar and punctuation are to be used throughout the report. The group reports should explain what you are doing with each chunk of code and to interpret the meaning of what you calculate so that a person that is not familiar with the problem could follow your logic.
Important! Several of these questions ask for explanations after obtaining the correct answer. For these, you do not have to write very long responses, just write enough to express the correct answer. When you are assigned into groups, that is when you will need to write everything out in full sentences.
The rail trail dataset
For this homework assignment, you will be working though a set of visualization problems based on the
rail_trail dataset. The
rail_trail dataset was collected by the Pioneer Valley Planning Commission (PVPC) and counts the number of people that walked through a sensor on a rail trail during a ninety day period. A rail trail is a retired or abandoned railway that was converted into a walking trail. The data was collected from April 5, 2005 to November 15, 2005 using a laser sensor placed at a location north of Chestnut Street in Florence, MA.
The dataset contains the following variables:
||daily high temperature (in degrees Fahrenheit)|
||daily low temperature (in degrees Fahrenheit)|
||average of daily low and daily high temperature (in degrees Fahrenheit)|
||indicates whether the season was Spring, Summer, or Fall|
||measure of cloud cover (in oktas)|
||measure of precipitation (in inches)|
||estimated number of trail users that day (number of breaks recorded)|
||indicator of whether the day was a non-holiday weekday|
How to describe your visualizations
When describing the contents of a visualization, follow the ideas discussed in these resources:
rail_traildataset, how many rows are there? How many columns? Which variables in the dataset are continuous/numerical and which are categorical?
Create a histogram of the variable
volumeusing the following code:
ggplot(data = rail_trail) + geom_histogram(mapping = aes(x = volume))
Describe the shape and center of the distribution. Afterward, try adjusting the size of the histogram bins by adding the
binwidthinput. To start with, use
binwidth = 21. If you need help with where to place
binwidth, read the documentation by running
?geom_histogramin your Console window. Then, find a binwidth that’s too narrow and another one that’s too wide to produce a meaningful histogram.
Choosing a proper bin width for a histogram can be tricky, and for that reason it’s preferable to use visualizations that avoid using bin widths when possible. An easy-to-use alternative to the histogram is
geom_density, which creates a density plot. Use
geom_densityto create a density plot of the variable
Create a density plot for each of the remaining numerical variables, and describe the shape and center of each distribution. Are there any distributions that are similar in shape to each other?
geom_point()to create a scatterplot that plots
season. Why is this plot not useful?
geom_count()plot (an alternative to a mosaic plot) using the same variables you considered in question 5:
ggplot(data = rail_trail) + geom_count(mapping = aes(x = season, y = weekday))
Which circle in the plot takes up the most area? Explain the meaning of the different size circles in the plot and what information it contains that is missing in the previous scatter plot.
?geom_barin the Console window and read the documentation for
geom_bar(), and then look at the entry for it on the ggplot2 cheatsheet Use
geom_bar()to reproduce the following bar chart:
After reproducing the plot, explain what the height of each bar means.
Starting from the code snippet you deduced in question 7, create two more bar charts:
Create a bar chart by supplying the input
position = "dodge"to
Create a bar chart by supplying the input
position = "fill"to
After creating the visualizations, describe the feature that
Create a bar chart that maps its aesthetic
precip > 0. Interpret what this bar chart means.
Create a scatter plot of
geom_point(). Describe any trends that you see.
Take the code snippet you wrote for question 10 and map the
color. Then create a second plot where, instead of mapping
color, you facet over
facet_grid(). Discuss the advantages and disadvantages to faceting instead of mapping to the
coloraesthetic. How might the balance change if you had a larger dataset?
Take the code snippet that you wrote down in question 11 that faceted over
weekdayand create a model for each facet panel using
geom_smooth(). Discuss the trends in the number of rail trail users that
Copy the code snippet you deduced in question 12 and use the input
se = FALSEfor
geom_smooth(). What does the
seinput option for
How to submit
When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to GitHub. To lock in your submission time, knit your R Markdown document to PDF, download the file from RStudio Server, and upload it to the R Markdown mini-assignment posting on Blackboard.
You are to also open a Pull Request on GitHub so that comments can be directly left on your R Markdown source files. After uploading to Blackboard, navigate to your copy of the GitHub repository you used for this assignment. You should see your repository, along with the updated files that you synchronized to GitHub. Do the following:
Click the Pull Requests tab near the top of the page.
Click the green button that says “New pull request”.
Click the dropdown menu button labeled “base:”, and select the option grading.
Confirm that the dropdown menu button labled “compare:” is set to master.
Click the green button that says “Create pull request”.
Give the pull request the following title: Submission: Homework 1, FirstName LastName, replacing FirstName and LastName with your actual first and last name.
In the messagebox, write My submission is ready for grading @jkglasbrenner.
Click “Create pull request” to lock in your submission.
You are encouraged to review and keep the following cheatsheets handy while working on this assignment: