Question: Logistic Regression This exercise relates to the diabetes dataset available on Blackboard as diabetes.csv. It contains demographic and medical data for 768 females over the

Logistic Regression This exercise relates to the diabetes dataset available on Blackboard as diabetes.csv. It contains demographic and medical data for 768 females over the age of 21. The variables are defined below:

Variable Name

Description

Pregnancies

Glucose BloodPressure SkinThickness Insulin

BMI

DiabetesPedigreeFunction Age

Outcome

Number of times pregnant

Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg)

Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml)

Body mass index ((Weight (in kg))/(Height (in 2))) Diabetes pedigree function

Age (in years)

Class variable (0 if no diabetes, 1 if individual has diabetes)

Please answer the following questions:

Load the data into R. Print the structure of the dataset and explain the output.

Hint: Use the read.csv and str commands. This can be done in 2 lines of code.

Convert the variable Outcome into a factor variable. Print the frequency distribution of the Outcome variable using the table command and explain what it means.

Hint: Use the as.factor and table commands. You only need two lines of code for this.

Create your training set with a random selection of 70% of the rows in the dataset and your testing set with the other 30%. Use seed value 123 for this randomization. Print the frequency distribution of the outcome variable in both train and test data. Are the two datasets similar in terms of the distribution of the outcome variable? Explain.

Hint: You can use the sample command for the split. You will also need the set.seed command.

Train a logistic regression model on the training dataset. How many of the variables are significant?

Hint: Use the glm and summary commands to for this part.

Generate predictions on the testing dataset using the model produced through logistic regression in step 5. Report the confusion matrix of your logistic regression model on the train set when the threshold is set to 0.25. Compute the accuracy, true positive rate, and false positive rate for the model.

Hint: You can use predict function for generating testing predictions, an ifelse command to create binary predictions, and table to create a confusion matrix. This should take only 3 lines of code.

Generate ROC plots and precision recall plots for both, the training and the testing dataset. Report the area under the curve and also attach the plots in your final submission. Provide brief explanations of what each curve and their respective AUCs represent.

Hint: Use the ROCR library.

An individual displays the following traits: pregnancies = 1, glucose = 130, blood pressure = 80, skin thickness = 22, insulin = 100, BMI = 25, diabetes pedigree function = 0.5, age = 50. According to your final model, what is the probability that the individual has diabetes? Show your working. Note: This is a manual calculation. Do not do this part with R. You can round the coefficient estimates to 2 decimal places for ease of work.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!