Question: here is the dataset https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings 1.c Data Sanity Checks 1.c.1) It is important to check if there are any internal inconsistencies within the dataset. One

here is the dataset
https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings
1.c Data Sanity Checks 1.c.1) It is important to check if there are any internal inconsistencies within the dataset. One natural question to ask would be: are "Global_Sales" consistent with the regional sales? That is, are the sums of "NA_Sales", "EU_Sales", "JP_Sales", and "Other_Sales" equal to "Global_Sales" for all entries? Examine this problem by: 1. Creating a new column in df named "Total_Sales" which contains the summation of the columns "NA Sales", "EU_Sales", "JP Sales", and "Other_Sales". 2. Calculating the absolute difference between "Total_Sales" and "Global_Sales" for each entry, and report the largest value of the absolute difference. Store the maximal deviation in a new variable named maxdeviation. Is maxdeviation 0? If not, what are the possible reasons? Is the dataset still acceptible despite nonzero deviations? (You don't need to write any answers) In [14]: ## Your code here maxdeviation =... print("The max deviation between "Total_Sales" and "Global_Sales\" is", maxdeviation In [ ]: grader. check("q1c1") 1.c.2) Recall that we have removed all duplicated entries from the dataframe, but we still want to make sure there is no subtle web scraping issues such as misspellings that prevent redundant entries from being removed. Does each entry represent one unique game? This question can be divided into two parts. How many entries (rows) are there in the dataframe now? Store answer (an integer) in a variable named len_total. How many distinct game names (in the column "Name") are there in the dataset? Store the integer result in a variable named len_name_unique. Each entry represents one unique game if and only if the two numbers are equal. In [17] : ## Your code here len_total = ... len_name_unique = ... print ("The number of non-duplicative entries is", len_total) print ("The number of distinct game names is", len_name_unique) print ("The two numbers are {0}".format("equal" if len_total=rlen_name_unique else "not In [ ]: grader.check("q1c2") 1.c.3) To take a deeper look into the structure of the dataset, 1. Create a subset of the DataFrame containing only entries of which the game names appear more than once among all entries. 2. Sort the new DateFrame according to the Name alphabetically in ascending order. Hint: pandas.DataFrame.groupby and pandas.core.groupby.DataFrame GroupBy.filter may be useful for the tasks in 1. For concrete illustrations and usages, see the Data 100 Lecture. Store the result into df_name_multi_sorted. This practice is intended to address why there are duplicated game names. In [20]: ## Your code here df_name_multi_sorted. head (5) In [ ]: grader. check("q1c3") Important: Before proceeding to the following sections, please make sure you have passed the tests for problems in 1.b. This will ensure df is ready for the following analyses. In [4]: ## Load the required modules import pandas as pd import numpy as np import matplotlib.pyplot as plt The dataset for this homework is based on https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings. Please read the Kaggle page for the complete description of the dataset. We've replaced the column name "Platform" with "Console" to avoid a conflict due to dummy variables generations (see 3.c). We start by loading the dataset with pandas. In [5]: ## No need for modification, just run this cell df = pd. read_csv ("HW1_dataset.csv") df.head (5) Out [5]: Name Console Year_of_Release Genre Publisher NA_Sales EU_Sales JP_Sales Other 0 Wii Sports Wii 2006.0 Sports Nintendo 41.36 28.96 3.77 1 Super Mario Bros. NES 1985.0 Platform Nintendo 29.08 3.58 6.81 2 Mario Kart Wii Wii 2008.0 Racing Nintendo 15.68 12.76 3.79 3 Wii Sports Resort Wii 2009.0 Sports Nintendo 15.61 10.93 3.28 Pokemon 4 Red/Pokemon Blue GB 1996.0 Role- Playing Nintendo 11.27 8.89 10.22
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
