Question: Utilize RStudio to answer this question (this is a data science class). 1. In this question, well be exploring data from Johns Hopkins about COVID.

Utilize RStudio to answer this question (this is a data science class).

1. In this question, well be exploring data from Johns Hopkins about COVID. The data contains country and sub-country level data about the total number of infections and deaths since the beginning of the pandemic.

a. The focus of our analysis is going to be on counties within the United States. Begin by keeping only data thats from the US. You should also remove any rows in the dataset that are not in one of the 50 states (like DC or Puerto Rico). It may help you to know that R has a (hidden) vector called state.name. Lastly, you should remove any rows where the county (i.e. Admin2) is Unassigned or Out of [State abbrev]. How many rows are left in the dataset after youve done this filtering?

b. Now were going to clean up the dataset so it only includes columns were interested in. You should clean up the dataframe so that it only contains the fipscode (FIPS), county name (Admin2), state (Province_State), confirmed cases (Confirmed), and deaths (Deaths). You can also rename these variables so that youre comfortable working with them. For this question, you dont need to include anything in your write-up.

c. How many confirmed COVID cases and deaths have there been in Philadelphia County, PA? How many counties have had at least as many cases as Philadelphia? How many counties have had at least as many deaths as Philadelphia?

d. How many counties have had at least 10,000 confirmed COVID cases? Among these counties with at least 10,000 cases, which one has had the highest rate of deaths to confirmed cases? What is that rate?

e. Now lets add some demographic data in our analysis. Use the readRDS() function to read the demographic data from the 2019 American Community Survey (acs-2019-demographics.RDS). Merge those data with the COVID dataset weve been working with. Which state seems to have the biggest issue with this merge? Why?

f. Create some new variables, each one using the population variable (pop) as the denominator:

confirmed cases rate: confirmed cases / population

death rate: deaths / population

over 65 percentage: people over 65 years old / population

percent white, non-hispanic: people who are white, non-hispanic / population

Now explore the relationship between at least two of those variables. You can do this with linear regression, by creating a graph, or whatever approach helps you to tell an interesting story about the data.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!