Question: part 1: Learning the Basics Using Arbuthnot Data This lab is broken up into two parts.In the first part, you will load data into the

part 1: Learning the Basics Using Arbuthnot Data

This lab is broken up into two parts.In the first part, you will load data into the workspace and do a variety of activities with the data which you will then repeat in Part 2 on your own with a different set of data for submission.You willNOTuse the RMarkdown file for Part 1.Instead, you will simply work within the workspace, primarily in the Console.In Part 2, you will download the RMarkdown file and do the required activities within the RMarkdown file for submission.

Navigating the RStudio Workspace

When you open RStudio, you should see the basic starting screen like in the following image.

The panel in the upper right contains yourworkspaceas well as a history of the commands that you've previously entered. Any plots that you generate will show up in the panel in the lower right corner.

The panel on the left is where the action happens. It's called theconsole. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you're running. Below that information is the prompt (>). As its name suggests, this prompt is really a request, a request for a command.

Loading and Exploring Data

To get started, open RStudio (it should be an icon in your taskbar at this point). Enter the following command at the R prompt (i.e. right after >on the console). You can either type it in manually or copy and paste it from this document.

source("http://www.openintro.org/data/R/arbuthnot.R")

this command instructs R to access the OpenIntro website and fetch some data: the Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper right-hand corner of the RStudio window (make sure you select the Environment tab) now lists a data set called arbuthnot that has 82 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of computation or some analysis you have performed.

The Data: Dr. Arbuthnot's Baptism Records

The Arbuthnot data set refers to Dr.John Arbuthnot, an 18th-century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the console.

arbuthnot

What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (anindexwe can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.

Note that the row numbers in the first column are not part of Arbuthnot's data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot's data in a kind of spreadsheet or table called adata frame.

You can see the dimensions of this data frame by typing:

dim(arbuthnot)

[1] 823

This command should output[1]823, indicating that there are 82 rows and 3 columns (we'll get to what the[1]means in a bit), just as it says next to the object in your workspace. You can see the names of these columns (or variables) by typing:

names(arbuthnot)

[1]"year""boys""girls"

You should see that the data frame contains the columnsyear, boys, and girls. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. Thedimandnamescommands, for example, each took a single argument, the name of a data frame.

One advantage of RStudio is that it comes with a built-in data viewer. Click on the namearbuthnotin the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner.

Let's start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like

arbuthnot$boys

This command will only show the number of boys baptized each year.

Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are calledvectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example,5218follows[1], indicating that5218is the first entry in the vector. And if[43]starts a line, then that would mean the first number on that line would represent the 43rdentry in the vector.

If you want to see the number of boys baptized in the fifth year of the data set, you can use the following command:

arbuthnot$boys[5]

The [5] is called theindex. The command says that we want the fifth number in the vector or list.

Plotting Data

R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command

plot(x = arbuthnot$year, y = arbuthnot$girls)

By default, R creates a scatterplot with each x,y pair indicated by an open circle. The plot itself should appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with two arguments separated by a comma. The first argument in the plot function specifies the variable for the x-axis and the second for the y-axis. If we wanted to connect the data points with lines, we could add a third argument, the letter l for line.

plot(x = arbuthnot$year, y = arbuthnot$girls, type = "l")

You might wonder how you are supposed to know that it was possible to add that third argument. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you're interested in. Try the following.

? plot

Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.

Mathematical Operations

Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like

5218 + 4683

to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys and girls, R will compute all sums simultaneously.

arbuthnot$boys + arbuthnot$girls

What you will see are 82 numbers (in that packed display, because we aren't looking at a data frame here), each one representing the sum we're after. Take a look at a few of them and verify that they are right. Therefore, we can make a plot of the total number of baptisms per year with the command

plot(arbuthnot$year, arbuthnot$boys + arbuthnot$girls, type = "l")

This time note that we left out the names of the first two arguments. We can do this because the help file shows that the default for plotis for the first argument to be the x-variable and the second argument to be the y-variable.

Similarly, if we wanted to know the ratio of boys to girls in 1629, we could do this with the divide symbol:

5218 / 4683

or we can act on the complete vectors with the expression to find the ratio for every year:

arbuthnot$boys / arbuthnot$girls

Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.

Likewise, we could find the proportion of newborns that are boys in 1629:

5218 / (5218 + 4683)

or this may also be computed for all years simultaneously:

arbuthnot$boys / (arbuthnot$boys + arbuthnot$girls)

Logical Operations

Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than,>, less than,<, and equality,==. For example, we can ask if boys outnumber girls in each year with the expression:

arbuthnot$boys > arbuthnot$girls

This command returns 82 values of eitherTRUEif that year had more boys than girls, orFALSEif that year did not (the answer may surprise you). This output shows a different kind of data than we have considered so far. In thearbuthnotdata frame our values are numerical (the year, the number of boys and girls). Here, we've asked R to createlogicaldata, data where the values are eitherTRUEorFALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with different types.

This seems like a fair bit for your first lab, so let's stop here. To exit RStudio, click thexin the upper right corner of the whole window.You will be prompted to save your workspace. If you clicksave, RStudio will save the history of your commands and all the objects in your workspace so that the next time you launch RStudio, you will seearbuthnotand will have access to the commands you typed in your previous session. For now, click Don't Save as you want to clear everything out for Part 2 of the lab.

Part 2 For Submission: CDC Data

In the previous part, you created some displays and preliminary analyses of Arbuthnot's baptism data. Your assignment involves repeating these steps but for present-day birth records in the United States. These data come from a report by the Centers for Disease Controlhttp://www.cdc.gov/nchs/data/nvsr/nvsr53/nvsr53_20.pdf. Check it out if you would like to read more about an analysis of sex ratios at birth in the United States.

Getting Started

Open RStudio (Do a File>Close Project if you have a project already open).

Create a new project (you might want to use Week 2 Wrapup Lab).

File > New Project Choose New Directory > New Project Provide a directory name and location. (you might want to use Week 2 Wrapup Lab). Use Browse to find where you want the folder created. Then click the Create Project button to complete the process.

Make sure all history is deleted before you get started.

Download the RMarkdown file into your project folder, and then open it in RStudio. File > Open File. Browse to the folder and select the appropriate template file with the .Rmd extension.

Type your name and date into the RMarkdown file.

We begin by loading the data into the R workspace.

Then go through and add code in the code blocks to do the following four exercises.

Exercise 1:

In the first code block, start out by loading up the present-day data with the following command.

source("http://www.openintro.org/data/R/present.R")

The data are stored in a data frame calledpresent.

In the RMD file, the following questions are provided before the code block.Type the commands in the code block which will answer these questions. The answer to questions a and b do not require anything other than running the code. For question c, in addition to running the code, in the white space below the code block please write out the answer to the question based on the code that you run. You can type the code or commands into the code block and then run the code one time using the green arrow in the top right of the code block (this button is particular, and you have to hit it in the right spot for the code to run).

a. What are the dimensions of the data frame?

b. What are the variable or column names?

c1.Display the years included in the data set.

c2. What years are included in this data set? (Type the answer after the code block.)

Exercise 2:

In the code block for Exercise 2, write and run the code to do the following tasks a and b. Answer question c in the white space below the code block.

a. Find the total number of births by year.

b. Plot the total number of births by year with year on the x-axis.

c. What observations do you make based on the plot?

NOTE: Sometimes when running a code block, there are multiple outputs that show as mini windows under the code block.You can click on the mini-windows to see their outputs. For Exercise 2, you will have an output for a and an output for b, so you will see the two mini-windows below the code block as shown here:

Exercise 3:

a. Make a plot that displays the boy-to-girl ratio for every year in the data set.

b. What do you conclude based on the plot?

Exercise 4:

a. What was the greatest number of births in the U.S. over the years in the dataset? Hints: The code for this is not found in the lab, but there is a way to figure out what you need to do.Let's break this down.First of all, the number of births is the total of girls and boys born each year.You should have done this in Exercise 2. The 'greatest' of this vector of total births is a 'maximum.'You can use the R reference card to find helpful commands:http://cran.r-project.org/doc/contrib/Short-refcard.pdf.

For instance, if we wanted to find the mean number of boys born during the time frame of our data, we see under the Math section that the command mean(x) will give us that number.'x' is the vector.So we would type for the command: mean(present$boys).If we wanted to know the average or mean of all births, we could type: mean(present$boys+present$girls).

Now use the reference card to find a command for maximum and use the code block provided for part a.

b. In what year was the total number of births in the U.S. the greatest? This will require two lines of code in the code block for part b.First, find the command which gives you the index where the maximum occurs on the reference card.You can find this command under the section entitled Data Selection and Manipulation. Run that to get the answer.This gives you the index, the place in the list, where the total is maximum.

Once you have that number, you can use that number as the index with present$year to find out the year.(Hint: Review above before Plotting Data.)Add that code into the code block and run the code block again.

Knit to HTML and History

Now you need to do the Knit to HTML to save your RMD file and create the HTML file.Also, save the History to the Rhist file as explained in the Midweek Lab

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!