Question: Individual Assignment 2 Linear Regression Weightage: 2 0 % of the final grade Submission date / time: Sunday, February 2 5 , 2 0 2

Individual Assignment 2
Linear Regression
Weightage: 20% of the final grade
Submission date/time: Sunday, February 25,2023 @ 11:59PM (PST)
Instructions for writing the report and submission:
- Do not copy questions in your report.
- Do NOT copy the dataset in the report.
- Follow APA style.
- Submit your report as a PDF file containing all the solutions, explanations and figures
requested. Save your file as FirstName-LastName.pdf
1.0 Introduction
An insurance company aspires to make changes to its yearly premium plans. They are keen on
leveraging its treasured historical data. Over the years, they have meticulously collected
information about countless customers, capturing details such as age, gender, BMI, number of
dependents, smoking habits, and of course, yearly medical charges. Armed with this wealth of
information, the company envisions a future where they can accurately predict the yearly
charges for potential customers. This vision not only empowers them to offer competitive and
tailored insurance plans but also to fine-tune their price of yearly premiums for maximum
efficiency.
They want to understand the factors affecting the yearly medical charges of its beneficiaries.
The company wants to know:
- Which variables are significant in predicting customers yearly medical charges?
- How well do those variables describe customers yearly medical charges?
Based on their beneficiarys historical data, the company has gathered a large dataset of
different factors contributing to medical charges.
Business Goal
You are required to model yearly medical charges with the available independent variables. It
will be used by the management to understand how medical charges vary with the
independent variables. They can accordingly manipulate their plans pricing strategy, business
strategy etc. to meet certain benefit levels. Further, the model will be a good way for the
company to predict yearly medical charges for potential customers.
2.0 Dataset
The dataset contains the information of 1338 customers. Here is the data dictionary which
states a short description of each feature as well as the data type.
1 age Age of primary beneficiary (Integer)
2 sex Insurance contractor gender (Categorical)
3 bmi Body mass index (Numerical)
4 children Number of children covered by health insurance (Integer)
5 smoker Smoking status of beneficiary (Categorical)
6 region The beneficiary's residential area in the US (Categorical)
7 charges
(Dependent Variable)
Individual medical costs billed by health insurance (Numerical)
3.0 Required Analysis
Using the descriptive analytics and regression tools, complete the following steps and report on
them.
1. Fix invalid values: There seems to be some inconsistency and spelling error in the sex
column: (mention the formula or the approach you are using for fixing them)
(Female, femal); (Male, Man). Fix the inconsistency and misspellings!
Hint: You may need to use IF function to automate the misspelling replacement.
2. Using descriptive analytics, complete the following table:
Age BMI Children Charges
count
mean
standard deviation
min
25% percentile
50% percentile
75% percentile
max
1. Create four new features: (write the formulas that you used for creating them)
a) bmi^2: defined as bmi multiplied by bmi
b) bmi-log: defined as logarithm of bmi (you can use LOG function in excel)
c) BMI-range: defined according to the following table
BMI BMI-range
<18.5 Underweight
18.524.9 Normal Weight
2529.9 Overweight
30=< Obese
d) Age range: defined according to the following table
Age Age-range
18-35 Young
36-59 Middle Age
60+ Senior
2. Plot the counts of beneficiaries (customers) based on each of the following categories (Also
provide results in tables or label the graphs):
(a) Age-range
(b) BMI-range
3. Plot the average of charges based on (Also provide results in tables or label the graphs):
(a) smoker
(b) sex
(c) BMI-range
(d)Age-range
4. Encode the categorical columns: smoker, sex. Explain the type of encoding you applied for
each of them and why.
5. Exclude the BMI-range, Age-range and region columns, obtain a linear regression model for
the rest of dataset where charge is the target variable, and the rest of columns are features
(independent variables). Discuss the obtained results in detail.
Important: make sure you apply the regression to the dataset that you modified in previous
steps.
Note: If you used one-hot-encoding, you need to exclude one of the generated columns in
each category.
6. What is the regression function obtained from the previous step?
7. If we want to use backward elimination technique to drop one of the features from the
model we obtained in previous step, which feature is the best choice for dropping? Obtain
the new model after dropping that feature and compare the R-square and adj- R square
with previous model.
8. The Company wants to predi

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!