Question: Question 1 a ) Read the boston dataset csv files provided for this assignment into Python ( you can use pd . read _ csv

Question 1
a) Read the boston dataset csv files provided for this assignment into Python (you can use pd.read_csv()). The boston datasets are boston_weather_1, boston_weather_2, boston_weather_3, boston_weather_4, and more_weather_variables, then assign the datasets to a DataFrame variable called boston_1, boston_2, boston_3, boston_4, and more_variables, respectively. Combine or concatenate the DataFrames, boston_1, boston_2, boston_3, and boston_4 and assign the results to a DataFrame called combined_boston. These four datasets should be combined vertically since they have the same variable names, such that boston_1 is stacked on top of boston_2, and the result is stacked on top of boston_3, and the result is further stacked on top of boston_4. Horizontally merge, join or concatenate the combined_boston and more_variables DataFrames and assign the results to a DataFrame called boston_data. Print the first five rows of the boston_data, and the last five rows of the boston_data. Also print out the shape of the boston_data.
b) Check the combined_boston to verify how many missing data points exist under each column.
c) Drop the rows or instances that contain any missing data. Assign the resulting DataFrame to a variable called clean_boston_data. Note that this is only one way of dealing with missing data and cases with missing data are usually used if you have sufficient sample size. Check for missing data again to ensure there is no missing data in the clean_boston_data. Print the shape of the clean_boston_data
d) Format all the column names to lowercase and include underscore between column names that consist of two words. For example, meanTemp should become mean_temp, and Max24hrPrep can become max_24hr_prep, and HighTemp becomes high_temp, etc. Reassign the DataFrame with the formatted column names to the same variable, clean_boston_data. Print or output the columns of the clean_boston_data DataFrame.
e) Select or slice all data from the clean_boston_data DataFrame, except the data where the Year is 1930. You can call this subset data excluding_1930. Using the excluding_1930 DataFrame, output the first 20 unique values in the Year column.
f) Select the data from the clean_boston_data where the Year is 1995 AND the high_temp is greater than or equal to 90. Output or display the whole selected data. Here, you dont have to assign it to any variable, but you could if you want to. g) Select the data from the clean_boston_data where the Year is 1995 OR the high_temp is greater than 89. Output or print the first 20 rows of the selected data. Here, you dont have to assign it to any variable, but you could if you want to.
Question 2
a) Read the student_data file provided into Python (take note of the file extension to use the appropriate pandas reader to read the data). Drop the first empty column in Python and assign the DataFrame to a variable student_data.
b) The student_data shows the different midterm scores of students in math, reading and science, and their favorite ice cream flavors. Select the data in the ice_cream_flavor column and convert the flavors to a numpy array, then assign it to a variable called flavor. From the student_data, select the math, reading and science scores all at once and convert the selected data to a numpy array and assign it to a variable called scores. Print the data in the flavor and scores arrays.
c) Use the scores and flavor arrays to slice out the scores where the flavor is chocolate only. The same result can be found using Pandas commands exclusively. Using the student_data data frame, find the scores where the flavor is chocolate only.
d) Use the scores and flavor arrays to slice out the scores where the flavor is chocolate OR vanilla. The same result can be found using Pandas commands exclusively. Using the student_data data frame, find the scores where the flavor is chocolate or vanilla.
e) Use the scores and flavor arrays to slice out the scores where the flavor is not chocolate (you can use the ~ sign). The same result can be found using Pandas commands exclusively. Using the student_data data frame, find the scores where the flavor is not chocolate.
f) Using the student_data data frame and Pandas commands, slice out all math and reading scores where the flavor is chocolate, then compute the mean of math and reading scores for this subset.
Question 3
Imagine that you wanted to use the student_data in question 2a to make predictions such that the ice_cream_flavor, math and reading columns are input variables and science column is the output variable you want to predict.
a) Use the LabelBinarizer() in the sklearn package to transform the ice_cream_flavor column in the student_data to dummy variables, then join these dummy variables to the student_data and drop the original ice_cream_flavor column. Reassign the resulting DataFrame to a variable called student_data_1. Print out the entire student_data_1 DataFrame.
b) The Pandas get_dum

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!