Question: Complete the following R codes by using R.studio ## 1pt: Comments entered correctly above (you removed the ' ' text) ---- ## 2pts: Install and

Complete the following R codes by using R.studio

## 1pt: Comments entered correctly above (you removed the '' text) ----

## 2pts: Install and Load the tidyverse and fivethirtyeight packages ----

install.packages("")

install.packages("")

library()

library()

## 1pt: Create object called flying_df using the built in flying dataset in the fivethirtyeight package ----

<- flying

## 1pt: view the R Documentation (help page) for flying ----

?

## 1pt: preview flying_df in the spreadsheet view (invoke a data viewer) ----

(flying_df)

## 2pts: How was this data collected and what does this data represent (what do the rows represent)? ----

#

#

## We are interested in exploring this data, and could predict many different outputs based on the...

##...columns we are given here.

## 2pts: Look at the given columns and discuss the top two/three things you think would be... ----

##...most interesting as the target of a predictive model. Give reasons why you chose those fields.

#

#

## 1pt: view a summary of the flying_df ----

(flying_df)

## The output of the above line shows us the statistics and counts per column (depending on the data type).

## Notice the NA counts in certain fields. These are missing values in our data.

## Also, some columns only return the length, class, and mode (for Character fields).

## Looking at the spreadsheet view of the data from before, it seems that all the values in...

##...each column were an option (not open-ended responses), so it would make more sense if the...

##...columns that are stored as Char are converted over to Factors instead.

## 3pts: convert all character fields into factors ----

## Note: this code is a copy/paste/tweak from: https://gist.github.com/ramhiser/93fe37be439c480dc26c4bed8aab03dd

flying_df <- %>%

mutate_if(sapply(flying_df, is.), as.)

## 1pt: Check that the data was successfully converted ----

summary(flying_df)

## Now let's count the number of missing values in each column

## 3pts: complete the sapply() to find the missing counts ----

sapply(, function(x) (is.(x)))

#for other ways of doing this visit: https://sebastiansauer.github.io/sum-isna/

## 1pt: what is the percent missing per column? ----

sapply(, function(x) round((is.(x))/nrow()*100,1))

## 4pts: calculate the % missing per row and save it in a new column "NA_per_row" ----

flying_df$ <- apply(, MARGIN = 1, function() round(sum(is.na(x))/ncol()*100,1))

## View the summary of the data again

summary(flying_df)

## 5pts: create bar chart for the count (on the y) of % missing per row (on the x). be sure to give it a title ----

%>%

group_by(NA_per_row) %>%

summarise(count = ) %>%

ggplot(aes(as.factor(),)) +

geom_col(aes(fill=count)) +

geom_text(aes(label = count),size=3.5, position = position_stack(vjust = 0.5)) +

labs(title = , subtitle="Raw Data", x = "NA per row (%)")

## If a row is mostly blank, we can't learn much about that person...so let's clean up (by removing)...

##...any row that has more than 70% missing values

## 3pts: Create new version of flying_df, called flying_clean, ... ----

##...that only keeps rows with less than or equal to 70% missing data

flying_clean <- %>%

filter( <= )

## 1pt: How many rows were just removed? How many rows remain? ----

# rows removed

# rows remain

## 1pt: View the graph again for counts (y) of % missing per row (x), but now of the flying_clean object ----

%>%

group_by(NA_per_row) %>%

summarise(count = ) %>%

ggplot(aes(as.factor(),)) +

geom_col(aes(fill = count)) +

geom_text(aes(label = count),size=3.5, position = position_stack(vjust = 0.5)) +

labs(title = ,subtitle="Clean Data", x = "NA per row (%)")

## Now let's see which columns have the most missing values still so we can come up with a cleaning plan

## 3pts: output a sorted vector (from high to low) of % missing values per column in flying_clean ----

sort(sapply(, function(x) round(sum(is.na())/nrow()*100,1)), decreasing = T)

## 1pt: which column has the majority of the missing values? ----

#

## Seems like the personal/demongraphic information is what we are missing the most.

## Given that this data is from a survey, let's clean up the survey question responses first.

## We're going to assume that there is actually some useful information to be had about those who...

##...skip questions. So instead of replacing the NAs with "artificial" data, let's label the NA's...

##...as "No Response" to perserve the data while still handling the NA problem.

## 3pts: Create the clean_qs obect where NAs are replace with "No Response" in every column EXCEPT: ----

##...household_income, education, location, gender, age, and children_under_18

clean_qs<- flying_clean %>%

select(!c(, , , , , )) %>%

select(names(.[sapply(., anyNA)])) %>% #this line is selecting any column that has an NA

mutate_all(as.character) %>%#in order to add our new level we need to first convert these cols to text

replace(., is.na(.), ) %>%#replace any NAs with "No Response"

mutate_all(as.)#convert the columns back to factors

## 2pts: overwrite the matching columns in flying_clean, with our clean_q columns ----

[,colnames(clean_qs)] <-

## Check the % missing now with these changes (same code as before) ----

sort(sapply(, function(x) round(sum(is.na())/nrow()*100,1)), decreasing = T)

## For our other columns, let's impute (assign/replace) new values based on the other columns we have.

## We're going to this becuase we'd like to assume the values aren't just missing at complete random,

## and we might be able to figure out what the values shouldbe using the other complete cases we have.

## (See MAR example here: https://uvastatlab.github.io/2019/05/01/getting-started-with-multiple-imputation-in-r/)

## To this, we're going to use the mice package (Multivariate Imputation by Chained Equations)

## 1pt: install and load the mice package ----

install.packages()

library()

## 2pts: use the mice() function to run multivariate imputation by chained equations on the flying_clean object ----

flying_mice <- (, m = 3)#note: this may take 5-10 min to run

## 2pts: use the complete() function on the flying_mice object to extrac the completely clean dataframe ----

flying_mice_df <- (, 1)

## Output a sorted list of % NAs per row of the flying_mice_df object ----

sort(sapply(, function(x) round(sum(is.na())/nrow()*100,1)), decreasing = T) #should return all 0s

# No more missing data!

summary(flying_mice_df)

## From here we could explore our data with more visuals and ultimately build a predictive model!

## Though we'll save that for next time!

## Up to 10 pts extra credit for any exploratory graphs using the clean data -----

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!