Question: Another useful step is to calculate some quick statistical summaries of the data. As you learned in the interactive tutorial, we can easily do this

Another useful step is to calculate some quick statistical summaries of the data. As you learned in the interactive tutorial, we can easily do this with the group_by and summarize functions. It is helpful to do this because it can warn of us of potential issues with the data such as missing data, or strange values.

i. Calculate summary statistics (the number of observations, mean, median, standard deviation, interquartile range, minimum value, and maximum value) of the fertility variable for each continent (i.e. group_by the continent variable first).

  • Because the column contains missing data, you will need to provide the na.rm = TRUE argument to the summary functions, e.g. mean = mean(fertility, na.rm = TRUE).

  • Remmember to use the IQR() function for the inter-quartile range.

  • You can calculate the number of observations using n(), and the number of missing observations using sum(is.na(fertility)). The fraction missing is the latter divided by the former.

    How does is.na() work?

    is.na() is a function that returns a Boolean value (TRUE/FALSE) depending on whether the input argument is missing (i.e. NA). If the input is a vector or a column, then the output will be a vector of TRUEs and FALSEs for each item in the input.

    However, in most programming languages, the Boolean value TRUE is equivalent to 1, and FALSE to 0. Therefore, we can add up a vector of Boolean values (using R's sum() function) to calculate the number of TRUEs in the vector. As this vector is the output of is.na() in our code, this will give us the number of missing values in that column!

  • Make sure this table doesn't overrun the right margin when you knit to a PDF! (You might have to shorten any long columns names.)

ii. We cannot calculate the mean, etc., of a categorical variable, but we can still calculate the number of observations in each group and how many are missing.

First group_by() the continent variable, and then use summarize() to calculate the number of observations, and the fraction of the region column that is missing. (You should find that there are no missing values for this variable!)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!