Question: Problem 3 : Heart Disease Dataset In Problem 3 , you will be working with the Statlog Heart Disease Dataset. This dataset contains medical information
Problem : Heart Disease Dataset
In Problem you will be working with the Statlog Heart Disease Dataset. This dataset contains medical information about
individuals, including a column that indicates whether or not the individual has heart disease. You can find more about
this dataset, including descriptions of its columns, here: Statlog Heart Dataset.
Load the data stored in the tabdelimited file heartdisease.txt into a DataFrame named hd Use head to display
the first rows of this DataFrame.
Our goal in this problem will be to create a logistic regression model to predict the label heartdisease using the other
columns as features. Note that a value of in the heartdisease column indicates an absence of heart disease, while a
value of indicates the presence of heart disease.
Perform the following steps in a single code cell:
Create a D feature array named X containing the relevant features, as well as a D label array named y
containing the labels. Note: These should be arrays, and not DataFrames or Series.
Use traintestsplit to split the data into training and testing sets using an split. Name the
resulting arrays Xtrain Xtest ytrain and ytest Set random state Use stratified
sampling.
Print the shapes of Xtrain and Xtest Include text labeling the two results as shown below. Add
spacing to ensure that the shape tuples are leftaligned.
Training Features Shape: xxxx
Test Features Shape: xxxx
We will now create a logistic regression model that can be used to estimate the probability that an individual has heart
disease based on the feature values.
Create a logistic regression model named hdmod with solverlbfgs and penalty'none'. Then fit the model
to the training data. If you get a warning message stating that the model failed to converge, then increase the maxiter
parameter until it does converge.
Display the intercepts and coefficients for the final model with text labels as shown below. Note that the coefficients
array will not fit on a single line, so please display it BENEATH the line containing the "Coefficients:" label.
Intercept: xxxx
Coefficients:
xxxx
We will now calculate the accuracy score for the model on both the training set and the test set.
Calculate and print the training and testing accuracy scores for your model, rounded to four decimal places. Include the
text labels explaining which value is which, as shown below. Add spacing to ensure that the scores are leftaligned.
Training Accuracy: xxxx
Testing Accuracy: xxxx
We will now use the model to generate predictions regarding the presence or absence of heart disease for individuals in the
test set.
Use your model to generate label predictions for observations in the test set. Store the results in a variable named
testpred Print the first observed labels for the test set, and then the first predicted labels. Include text
labels with your output as shown below. Each label array should be displayed on a single line, and the two arrays should
be leftaligned.
Observed Labels: xxxx
Predicted Labels: xxxx
As a final step, we will use our model to estimate the probability that each individual in the test set has heart disease.
Use the predictproba method of your model to estimate probabilities of being in each of the two classes for each
individual in the test set. This function returns a D array. Display the predicted probabilities for the first observations
in the test set as a DataFrame with the columns named according to the labels that they represent and
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
