Question: Dataset name:Medical Cost Personal Dataset. please use this data set for below questions https://www.kaggle.com/datasets/mirichoi0218/insurance Dataset and task description Describe the data analytics task for the

Dataset name:"Medical Cost Personal Dataset". please use this data set for below questions

https://www.kaggle.com/datasets/mirichoi0218/insurance

  1. Dataset and task description

Describe the data analytics task for the chosen dataset. How is the dataset used to solve a business problem? Show the dataset characteristics, namely the number of instances, the number of numerical and categorical variables, and the description and the domain for each.

2. Descriptive statistics

a. Categorical variables

2.a.1 Consider one categorical variable and create a frequency table including frequency, relative frequency, cumulative frequency and relative cumulative frequency. [5pts]

  • Provide a summative interpretation of the values in the frequency table. What is the most frequent category and the least frequent ones with their percentages?
  • Report any invalid values or errors in the data and explain how to correct them.
  • Correct the errors in the data and show the frequency table for each categorical variable after the corrections.

2.a.2 Create a pivot table with one categorical variable selected in the ROWS, one numerical variable in the COLUMNS (apply grouping), and the average of the target variable in the VALUES. When a column contains a missing value, it is represented with '?', grouping does not work. You will have to replace those missing values with the mean, save the worksheet and recreate the pivot table. Show the pivot chart and interpret. [10pts]

b. Numerical variables

2.b.1 Provide a table with the list of all numerical variables, their min, max, average, Q1, Q2 and Q3 and Q4. Provide an interpretation of the quartiles.

2.b.2. Create a histogram for one numerical variable. What is the shape of the distribution? Interpret? If it is right or left skewed, relate this to a meaningful sentence explaining what most range most of the values fall in. If it is a normal distribution, use the empirical rule to report the percentage of data points and associated range of values.

2.b.3. Create a box plot for two numerical variables. You may try many numerical variables and keep only the one that exhibits outliers. Interpret the results (how is the variability of the distribution (high/low), are there outliers?

Please share the Excel file as well, or you can share the link to Google share

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!