CDS-101: Introduction to Computational and Data Sciences

Github Classroom repo for Final Project

Due Dates

List of project questions: December 7, 2018 @ 11:59pm
Final Project Report: December 14, 2018 @ 11:59pm

Instructions

For the final project, you will be assigned into groups that will conduct an exploratory data analysis using the skills you’ve developed over the semester and write a report. Each group will motivate their exploration by formulating interesting questions that can be answered with the dataset, which are then answered by computing summary statistics as well as wrangling the data into a form that allows you to visualize it. The key words are summary statistics and visualize; the answer to each question must involve the presentation and discussion of visual and quantitative information. If desired, groups may also bring in additional data to strengthen the analysis, but please note that any additional data must be documented, which includes describing how you obtained it and how you’ve integrated it with the main dataset to further your analysis.

The Dataset

All groups will be working with the College Scorecard dataset started by The Obama Administration in September 2015. The dataset is included in the starter repository for your group and it is loaded by default (the variable name is college) in the setup block of the R Markdown file where you write your group report. The source link for the dataset is found here: https://collegescorecard.ed.gov/data/.

The data code-book, which tells you what the columns mean, is available at https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx. You will have to look through the code-book to understand the meaning of the variables, and this should be your starting point before you start running an analysis on the dataset.

For further information about the dataset, consult the documentation pages at https://collegescorecard.ed.gov/data/documentation/.

This is a large dataset that that contains millions of individual cells. As such, there is no one right way to approach this project. There are many different avenues that you can take, so have fun with it!

Grade

Grades for the final project will be based on the correctness and readability of your R code, how well your report is written (the report should be structured, coherent, and follow the standard rules of spelling and grammar), and how well the exploratory questions are answered. You will be graded as an individual even though this is a group project, meaning that any group members that are judged to have not sufficiently contributed to the final product will have their grade penalized.

As stated in the class syllabus, this project is worth 25% of your class grade.

The report

The group report is built around interesting questions you can ask of the dataset. Each group member will construct and answer 1 question about the dataset. So, for example, a group of four members will have 4 questions in total. These questions must be decided upon first before you begin putting your report together.

The report is to have two sections, Preprocessing the dataset and Exploratory data analysis. Preprocessing the dataset is where you the group will work together to organize and prepare into a common format that everyone will start from when answering his or her question. Exploratory data analysis is where you specify and then answer each question.

The subsections below give more detail about what is expected from your questions and what to put in the report.

Choosing your questions

There are some restrictions on what I will consider a valid question for this final project. Here is a short list of criteria that your question must meet:

Your question cannot be a simple lookup query, like “how many students go to this school?”
The question must be about comparing relationships and trends between two or more variables in the dataset, either by comparing two or more columns or by comparing well-defined groups of values within a single column.
Answering the question must involve creating one or more visualizations and the calculation of one or more summary statistics.
Answering the question must require data aggregation (group_by/summarize) or identifying a trend between 2 or more variables.

Your group will submit a preliminary list of questions by the end of the day on December 7 on Blackboard for my review and comment. For each question, you should provide the following:

The question itself written as a complete sentence.
A list of the columns you plan to use to answer the question.
A one or two sentence explanation for why you think it’s a worthwhile question to ask.

I reserve the right to adjust or veto any questions that do not meet the outlined criteria or that cannot be appropriately justified. Submitting these on time to Blackboard will be counted as part of your final project grade.

Preprocessing the dataset section

This dataset is structured and mostly clean, but there is still some data preprocessing that needs to be done before you can begin analysis. At a minimum, there are three clear tasks to complete and document in this first section of the R Markdown file before you continue on to the Exploratory data analysis section:

After your group has decided on the questions you will answer, figure out which columns in the dataset your group needs to answer the questions. Then, extract those columns using select and save the reduced dataset to another variable, for example college_reduced.
The column names are shortened abbreviations, and should be made more human-readable using the rename function. Use the data code-book, https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx, to help you figure out what the abbreviations mean.
Categorical variables that are not easy to understand, for example the integer categories under the REGION column, should be relabeled using the recode function.

Additional preprocessing steps may be necessary, depending on the questions your group is trying to answer. Regardless of what you do, these steps should be documented in the usual way, where code blocks are accompanied with written explanations.

Each question in the Exploratory data analysis section must be answered starting from the preprocessed dataset you end up with at the end of this section!

Exploratory data analysis section

This section is where each group member is responsible for answering his or her respective question about the dataset. Each question should be clearly stated and each answer should start from the preprocessed dataset obtained in the previous section. The procedure for obtaining the answer needs to be documented in the form of both code blocks and plain text. Calculated summary statistics and visualizations need to be discussed and interpreted for the reader. For example, if you present a distribution, discuss it’s shape, center, and spread, which you should relate to the computed summary statistics. If you present a scatter plot, what is the trend of the points? After analyzing the various outputs, synthesize it and provide a formal answer to your stated question.

Here is a quick checklist to apply to each question and answer:

Is the question clearly stated?
Does the code used to find the answer contain at least one dplyr-based data transformation, one calculation of a summary statistic, and one ggplot2 data visualization?
Are the results obtained from the code properly explained and interpreted for the reader?
Is a final answer clearly given for the question?

How to edit and merge each group member’s content into the final report

Like in the group submissions for the homework assignments, collaboration on the GitHub repository will work best if you create your own versions of the final_project.Rmd file where you make a copy and rename it to include your last name. For example, if your last name is Smith, then you should make a file copy and rename it to final_project_smith.Rmd. Eventually, the group will need to merge in everyone’s answers into the final_project.Rmd document. The following checklist may help with this:

Select an editor to be in charge of merging everyone’s answers into the final document final_project.Rmd.
The editor should ensure that everyone has committed and pushed their answers to GitHub so they can copy and paste in each contribution.
The editor needs to make sure that the final submission reads like a coherent document and that the writing style and code style are uniform across all the answers. In other words, it should read like a single person answered all the questions, not a group of four people.
The editor will be in charge of of committing and pushing the final R Markdown file to GitHub, knitting to PDF, and uploading the PDF file on Blackboard.

Additional guidelines

The following are additional guidelines for your Final Project submission:

Your group’s final project will only be graded if a PDF of your final report is submitted on Blackboard and the final versions of the group’s R Markdown files are pushed to GitHub.
The final R Markdown report file in the group’s GitHub repository must knit to the PDF format without error.
You must use the tidyverse functions in your work, I should not see any “base R” functions used (subset, for example) that were not shown in the lecture slides or the textbook.
Each group member must contribute substantive commits to the group repository on Github that reflect his/her own work.
The report represents your group’s final results and should only contain the methods used to obtain them. Do not include questions in the report that you cannot answer.
Your R code should be clean and readable, which includes what it looks like after knitting. Code blocks should not run off the side of the page when knitted to PDF!
The report’s tone should be professional and should not read like a social media feed or personal blog. Refrain from editorializing about the project as a whole or about a specific question, as this is not an opinion paper. Do not speculate, instead support your claims and explanations using data and analysis. Avoid self-narration or writing about how you felt or what you were thinking as you complete each question, instead write as if you are constructing a step-by-step tutorial for others to use.
Late submissions for the final project will not be accepted, no exceptions.

How to submit

When you are ready to submit, be sure that everyone in your group has committed and pushed their R Markdown files to Github. The editor should do the following in RStudio Server:

If it is not loaded already, open the RStudio project for the group homework assignment.
Click on the Git tab in the upper-right window, then click “Pull” to make sure you have the latest version of all the files.
Press Ctrl+Shift+B to knit the final submission to the PDF format.
Download the PDF file to your computer.
On Blackboard, find the Final Project submission link for your group. Submit the PDF for your group to lock in your submission time.

The editor should confirm that the files for their group’s repository webpage on GitHub are fully up-to-date, and that if I were to knit the PDF file myself, it should be identical to what is uploaded to Blackboard.