Homework 3
Submissions due: November 27, 2018 @ 11:59pm
Instructions
IMPORTANT! DO THIS FIRST!
You must have the most up-to-date version of infer installed on RStudio Server, otherwise the functions shade_p_value()
and shade_confidence_interval()
will not work. To get the most up-to-date version of infer, run the following code in the Console window of RStudio Server:
remotes::install_github("tidymodels/infer", ref = "develop")
Obtain the Github repository you will use to complete homework 3, which contains a starter RMarkdown file named homework_3.Rmd, which you will use to do your work and write-up when completing the questions below. Be sure to save, commit, and push (upload) frequently to Github so that you have incremental snapshots of your work. When you’re done, follow the How to submit section below to setup a Pull Request, which will be used for feedback.
Remember that the point of us using RMarkdown documents is to combine code and writeups! Each block of R code should have some sort of explanation or justification using full sentences.
Your grade will take into account your code, your explanations, and whether your document looks nice when “knitted” to PDF.
About the dataset
This homework uses the National Survey of Family Growth, Cycle 6 dataset in the file 2002FemPreg.rds, published by the National Center for Health Statistics. Complete descriptions of all the variables can be found in the NSFG Cycle 6: Female Pregnancy File Codebook. Below are selected descriptions of variables that will be used for this homework assignment:
Variable | Description |
---|---|
caseid | integer ID of the respondent |
prglngth | integer duration of the pregnancy in weeks |
outcome | integer code for the outcome of the pregnancy, with a 1 indicating a live birth |
birthord | serial number for live births; the code for a respondent’s first child is 1, and so on. For outcomes other than live birth, this field is blank |
Questions
This homework assignment revolves around answering the following question using this dataset:
Do first born children either arrive early or late when compared with non-first-borns?
The questions in this assignment will guide you through the process of answering this question using statistical inference.
Addressing the question “do first born children either arrive early or late when compared with non-first-borns?” means that we should only consider live births in the dataset. Filter the dataset so that it only contains outcomes with live births and assign this result to the variable
live_births
.Next, we need to label all births in
live_births
as either “first” or “other” so that we can easily find the children that are first borns and the ones that are not. There are a couple of different ways that you can label the births:Split the dataset into two parts, a
first_births
dataset and another_births
dataset. Do this by applying a filter to extract the first births, then usemutate()
to create a new column calledbirth_order
that labels these rows as “first”. Then repeat this process, except apply a filter to extract all other births and label those as “other” inbirth_order
. To recombine the datasets into one, usebind_rows()
.Use
if_else()
withmutate()
to create thebirth_order
column and the “first” and “other” labels. You can also try usingcase_when()
instead ofif_else()
to accomplish this. If you do it this way, you won’t need to usebind_rows()
.
After labeling the births, remove the extraneous variables from the data frame leaving just the
prglngth
andbirth_order
columns. Assign the resulting data frame to a variable namedpregnancy_length
.Take the
pregnancy_length
dataset and plot a probability mass function (PMF) histogram of the pregnancy length in weeks that shows “first” births and “other” births on the same plot. Choose a reasonable binwidth for the histogram and addcoord_cartesian(xlim = combine(27, 46))
to your plot so that the window focuses where most of the data is. After creating the plot, describe the shape, center, and spread of the two distributions. Based on the visualization, do you think the data looks like it supports the statement that “first born children either arrive early or arrive late when compared with non-first-borns”?Group the variable
prglngth
into “first” and “other” births and compute the summary statistics (mean, median, standard deviation, inter-quartile range, minimum, maximum) for each group. How do the different summary statistics compare between the two distributions? Does it look like there may be a notable difference between the two distributions? Explain.Plot the cumulative distribution functions (CDFs) of the pregnancy lengths for “first” and “other” births. The CDF for both distributions should be on the same figure so that we can directly compare them (hint: they should be two different colors and partially transparent). How do the distributions compare? Does it look like there is there a meaningful difference between the two distributions?
If we want to determine whether or not the difference between two distributions is statistically significant, we need to run a hypothesis test. Formalize your analysis by writing down the null hypothesis for the question of whether first babies arrive early or arrive late when compared with non-first-borns. It should be clear from how you write your null hypothesis whether you’re conducting a one-sided or two-sided hypothesis test. You should also restate what the observed value is (this will be a number you compute using the data in
prglngth
) that you will be comparing against the null distribution.Use a simulation to generate the null distribution so that you can perform the hypothesis test. Do this using the functions provided in the
infer
package,specify()
,hypothesize()
,generate()
, andcalculate()
. To collect enough statistics, you should set the simulation to repeat 10,000 times.Once you’ve obtained the null distribution, get the p-value for your hypothesis test using
get_pvalue()
. Assuming a significance level of \(\alpha = 0.05\), state whether we can we reject the null hypothesis.Finally, visualize the simulated null distribution using
visualize()
and the meaning of the p-value usingshade_p_value()
.Use a bootstrap simulation to calculate the 95% confidence interval for your observed value. Do this using the following functions from the
infer
package,specify()
, `generate()
, andcalculate()
. To collect enough statistics, you should set the bootstrap simulation to repeat 10,000 times.Once you’ve obtained the bootstrap distribution, use
get_confidence_interval()
to find the upper and lower bounds of the 95% confidence interval. Does the value corresponding to the null result fall within the range of the 95% confidence interval?Finally, visualize the bootstrap distribution using
visualize()
and show the 95% confidence interval usingshade_confidence_interval()
.In addition to hypothesis tests and confidence intervals, we should also consider the effect size, which measures the relative difference between two distributions. The effect size helps us better know how important a given result actually is, not just whether or not we can reject the null hypothesis. One measure of the effect size is called Cohen’s d (https://en.wikipedia.org/wiki/Effect_size#Cohen.27s_d), which we will use to compute the effect size between the pregnancy lengths for “first” and “other” births. The different ranges of d can be interpreted using the following table:
Effect size d Very small 0.01 Small 0.20 Medium 0.50 Large 0.80 Very large 1.20 Huge 2.00 The following set of functions should also be preloaded for you:
cohens_d_bootstrap()
,bootstrap_report()
, andplot_ci()
. These functions will use bootstrap simulations to compute the confidence interval for the Cohen’s d parameter. Run the bootstrap simulation as follows (you can just copy and paste this code):cohens_d_bootstrap(data = pregnancy_length, model = prglngth ~ birth_order)
Be sure to assign the results to a variable, for example
bootstrap_results
.To print a report for the bootstrap simulation, run:
bootstrap_report(bootstrap_results)
To visualize the bootstrap distribution and confidence interval, run:
plot_ci(bootstrap_results)
Using the provided table, report how large the effect size is for the difference in pregnancy lengths for “first” and “other” births.
How to submit
When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to GitHub. Then, knit your R Markdown document to PDF, export (download) the file from RStudio Server, and upload it to the Homework 3 posting on Blackboard.
You are to also open a Pull Request on GitHub so that comments can be directly left on your R Markdown source files. After uploading to Blackboard, navigate to your copy of the [GitHub repository][github-classroom] you used for this assignment. You should see your repository, along with the updated files that you synchronized to GitHub. Do the following:
Click the Pull Requests tab near the top of the page.
Click the green button that says “New pull request”.
Click the dropdown menu button labeled “base:”, and select the option
grading
.Confirm that the dropdown menu button labeled “compare:” is set to
master
.Click the green button that says “Create pull request”.
Give the pull request the following title: Submission: Homework 3, FirstName LastName, replacing FirstName and LastName with your actual first and last name.
In the messagebox, write: My homework submission is ready for grading @jkglasbrenner.
Click “Create pull request” to lock in your submission.
Cheatsheets
You are encouraged to review and keep the following cheatsheets handy while working on this assignment: