Question: Dataset name:Medical Cost Personal Dataset. please use this data set for below questions https://www.kaggle.com/datasets/mirichoi0218/insurance Dataset and task description Describe the data analytics task for the
Dataset name:"Medical Cost Personal Dataset". please use this data set for below questions
https://www.kaggle.com/datasets/mirichoi0218/insurance
- Dataset and task description
Describe the data analytics task for the chosen dataset. How is the dataset used to solve a business problem? Show the dataset characteristics, namely the number of instances, the number of numerical and categorical variables, and the description and the domain for each.
2. Descriptive statistics
a. Categorical variables
2.a.1 Consider one categorical variable and create a frequency table including frequency, relative frequency, cumulative frequency and relative cumulative frequency. [5pts]
- Provide a summative interpretation of the values in the frequency table. What is the most frequent category and the least frequent ones with their percentages?
- Report any invalid values or errors in the data and explain how to correct them.
- Correct the errors in the data and show the frequency table for each categorical variable after the corrections.
2.a.2 Create a pivot table with one categorical variable selected in the ROWS, one numerical variable in the COLUMNS (apply grouping), and the average of the target variable in the VALUES. When a column contains a missing value, it is represented with '?', grouping does not work. You will have to replace those missing values with the mean, save the worksheet and recreate the pivot table. Show the pivot chart and interpret. [10pts]
b. Numerical variables
2.b.1 Provide a table with the list of all numerical variables, their min, max, average, Q1, Q2 and Q3 and Q4. Provide an interpretation of the quartiles.
2.b.2. Create a histogram for one numerical variable. What is the shape of the distribution? Interpret? If it is right or left skewed, relate this to a meaningful sentence explaining what most range most of the values fall in. If it is a normal distribution, use the empirical rule to report the percentage of data points and associated range of values.
2.b.3. Create a box plot for two numerical variables. You may try many numerical variables and keep only the one that exhibits outliers. Interpret the results (how is the variability of the distribution (high/low), are there outliers?
Please share the Excel file as well, or you can share the link to Google share
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
