Question: Complete the following R codes by using R.studio ## 1pt: Comments entered correctly above (you removed the ' ' text) ---- ## 2pts: Install and
Complete the following R codes by using R.studio
## 1pt: Comments entered correctly above (you removed the '
## 2pts: Install and Load the tidyverse and fivethirtyeight packages ----
install.packages("
install.packages("
library(
library(
## 1pt: Create object called flying_df using the built in flying dataset in the fivethirtyeight package ----
## 1pt: view the R Documentation (help page) for flying ----
?
## 1pt: preview flying_df in the spreadsheet view (invoke a data viewer) ----
## 2pts: How was this data collected and what does this data represent (what do the rows represent)? ----
#
#
## We are interested in exploring this data, and could predict many different outputs based on the...
##...columns we are given here.
## 2pts: Look at the given columns and discuss the top two/three things you think would be... ----
##...most interesting as the target of a predictive model. Give reasons why you chose those fields.
#
#
## 1pt: view a summary of the flying_df ----
## The output of the above line shows us the statistics and counts per column (depending on the data type).
## Notice the NA counts in certain fields. These are missing values in our data.
## Also, some columns only return the length, class, and mode (for Character fields).
## Looking at the spreadsheet view of the data from before, it seems that all the values in...
##...each column were an option (not open-ended responses), so it would make more sense if the...
##...columns that are stored as Char are converted over to Factors instead.
## 3pts: convert all character fields into factors ----
## Note: this code is a copy/paste/tweak from: https://gist.github.com/ramhiser/93fe37be439c480dc26c4bed8aab03dd
flying_df <-
mutate_if(sapply(flying_df, is.
## 1pt: Check that the data was successfully converted ----
summary(flying_df)
## Now let's count the number of missing values in each column
## 3pts: complete the sapply() to find the missing counts ----
sapply(
#for other ways of doing this visit: https://sebastiansauer.github.io/sum-isna/
## 1pt: what is the percent missing per column? ----
sapply(
## 4pts: calculate the % missing per row and save it in a new column "NA_per_row" ----
flying_df$
## View the summary of the data again
summary(flying_df)
## 5pts: create bar chart for the count (on the y) of % missing per row (on the x). be sure to give it a title ----
group_by(NA_per_row) %>%
summarise(count =
ggplot(aes(as.factor(
geom_col(aes(fill=count)) +
geom_text(aes(label = count),size=3.5, position = position_stack(vjust = 0.5)) +
labs(title =
## If a row is mostly blank, we can't learn much about that person...so let's clean up (by removing)...
##...any row that has more than 70% missing values
## 3pts: Create new version of flying_df, called flying_clean, ... ----
##...that only keeps rows with less than or equal to 70% missing data
flying_clean <-
filter(
## 1pt: How many rows were just removed? How many rows remain? ----
#
#
## 1pt: View the graph again for counts (y) of % missing per row (x), but now of the flying_clean object ----
group_by(NA_per_row) %>%
summarise(count =
ggplot(aes(as.factor(
geom_col(aes(fill = count)) +
geom_text(aes(label = count),size=3.5, position = position_stack(vjust = 0.5)) +
labs(title =
## Now let's see which columns have the most missing values still so we can come up with a cleaning plan
## 3pts: output a sorted vector (from high to low) of % missing values per column in flying_clean ----
sort(sapply(
## 1pt: which column has the majority of the missing values? ----
#
## Seems like the personal/demongraphic information is what we are missing the most.
## Given that this data is from a survey, let's clean up the survey question responses first.
## We're going to assume that there is actually some useful information to be had about those who...
##...skip questions. So instead of replacing the NAs with "artificial" data, let's label the NA's...
##...as "No Response" to perserve the data while still handling the NA problem.
## 3pts: Create the clean_qs obect where NAs are replace with "No Response" in every column EXCEPT: ----
##...household_income, education, location, gender, age, and children_under_18
clean_qs<- flying_clean %>%
select(!c(
select(names(.[sapply(., anyNA)])) %>% #this line is selecting any column that has an NA
mutate_all(as.character) %>%#in order to add our new level we need to first convert these cols to text
replace(., is.na(.),
mutate_all(as.
## 2pts: overwrite the matching columns in flying_clean, with our clean_q columns ----
## Check the % missing now with these changes (same code as before) ----
sort(sapply(
## For our other columns, let's impute (assign/replace) new values based on the other columns we have.
## We're going to this becuase we'd like to assume the values aren't just missing at complete random,
## and we might be able to figure out what the values shouldbe using the other complete cases we have.
## (See MAR example here: https://uvastatlab.github.io/2019/05/01/getting-started-with-multiple-imputation-in-r/)
## To this, we're going to use the mice package (Multivariate Imputation by Chained Equations)
## 1pt: install and load the mice package ----
install.packages(
library(
## 2pts: use the mice() function to run multivariate imputation by chained equations on the flying_clean object ----
flying_mice <-
## 2pts: use the complete() function on the flying_mice object to extrac the completely clean dataframe ----
flying_mice_df <-
## Output a sorted list of % NAs per row of the flying_mice_df object ----
sort(sapply(
# No more missing data!
summary(flying_mice_df)
## From here we could explore our data with more visuals and ultimately build a predictive model!
## Though we'll save that for next time!
## Up to 10 pts extra credit for any exploratory graphs using the clean data -----
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
