Question: You are applying for a position at the data science team of USDA and you are given data associated with determining appropriate parasite treatment of

You are applying for a position at the data science team of USDA and you are given data associated with determining appropriate parasite treatment of canines. The suggested treatment options are determined based on a logistic regression model that predicts if the canine is infected with a parasite. The data is given in the site: https://data.world/ehales/grls-parasite-study/workspace/file?filename=CBC_data.csv and more specifically in the CBC_data.csv file. Login using you University Google account to access the data and the description that includes a paper on the study (you dont need to read the paper to solve this problem). Your target variable column is titled parasite_status.

Question 1 - Feature Engineering (5 points) In this step you outline the following as potential features (this is a limited example - we can have many features as in your programming exercise below). Write the posterior probability expressions for logistic regression for the problem you are given to solve. (=1|,)= (=0|,)=

Question 2 - Decision Boundary (5 points) Write the expression for the decision boundary assuming that (=1)=(=0) . The decision boundary is the line that separates the two classes.

Type Markdown and LaTeX: 2

Question 3 - Loss function (5 points) Write the expression of the loss as a function of that makes sense for you to use in this problem. NOTE: The loss will be a function that will include this function: ()=11+ =

Question 4 - Gradient (5 points) Write the expression of the gradient of the loss with respect to the parameters - show all your work. =

Question 5 - Imbalanced dataset (10 points) You are now told that in the dataset (=0)>>(=1) Can you comment if the accuracy of Logistic Regression will be affected by such imbalance?

Type Markdown and LaTeX: 2

Question 6 - SGD (15 points) The interviewer was impressed with your answers and wants to test your programming skills. Use the dataset to train a logistic regressor that will predict the target variable . Report the harmonic mean of precision (p) and recall (r) i.e the metric called 1 score that is calculated as shown below using a test dataset that is 20% of each group. Plot the 1 score vs the iteration number . 1=21+1 Your code includes hyperparameter optimization of the learning rate and mini-batch size.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!