Question: Logistic Regression This exercise relates to the diabetes dataset available on Blackboard as diabetes.csv. It contains demographic and medical data for 768 females over the
Logistic Regression This exercise relates to the diabetes dataset available on Blackboard as diabetes.csv. It contains demographic and medical data for 768 females over the age of 21. The variables are defined below:
| Variable Name | Description |
| Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome | Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index ((Weight (in kg))/(Height (in 2))) Diabetes pedigree function Age (in years) Class variable (0 if no diabetes, 1 if individual has diabetes) |
Please answer the following questions:
Load the data into R. Print the structure of the dataset and explain the output.
Hint: Use the read.csv and str commands. This can be done in 2 lines of code.
Convert the variable Outcome into a factor variable. Print the frequency distribution of the Outcome variable using the table command and explain what it means.
Hint: Use the as.factor and table commands. You only need two lines of code for this.
Create your training set with a random selection of 70% of the rows in the dataset and your testing set with the other 30%. Use seed value 123 for this randomization. Print the frequency distribution of the outcome variable in both train and test data. Are the two datasets similar in terms of the distribution of the outcome variable? Explain.
Hint: You can use the sample command for the split. You will also need the set.seed command.
Train a logistic regression model on the training dataset. How many of the variables are significant?
Hint: Use the glm and summary commands to for this part.
Generate predictions on the testing dataset using the model produced through logistic regression in step 5. Report the confusion matrix of your logistic regression model on the train set when the threshold is set to 0.25. Compute the accuracy, true positive rate, and false positive rate for the model.
Hint: You can use predict function for generating testing predictions, an ifelse command to create binary predictions, and table to create a confusion matrix. This should take only 3 lines of code.
Generate ROC plots and precision recall plots for both, the training and the testing dataset. Report the area under the curve and also attach the plots in your final submission. Provide brief explanations of what each curve and their respective AUCs represent.
Hint: Use the ROCR library.
An individual displays the following traits: pregnancies = 1, glucose = 130, blood pressure = 80, skin thickness = 22, insulin = 100, BMI = 25, diabetes pedigree function = 0.5, age = 50. According to your final model, what is the probability that the individual has diabetes? Show your working. Note: This is a manual calculation. Do not do this part with R. You can round the coefficient estimates to 2 decimal places for ease of work.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
