1. Click-Through Rate Prediction (3 points) In this problem, will build a model to predict the click-through...
Question:
1. Click-Through Rate Prediction (3 points)
In this problem, will build a model to predict the click-through rate of an advertising demand- side platform (DSP). Your job is to predict whether an advertisement will be clicked by a mobile phone customer.
This dataset contains the following 8 variables:
dc = Click-through. 1 means clicked, 0 means not clicked. This is the outcome variable the DSP cares about.
atype = the AdExchange platform where the advertisement slot is traded.
bidf = the lowest bid price of the advertisers on the AdExchange.
instl = full-screen advertisment. 1 means full-screen, 0 means half-screen.
isp = code for the telecommunications company of the customer. 0 means unknown, 1 means China Mobile, 2 means China Unicom, 3 means China Telecome.
nt = the broadband cellular network technology code, 0 means unknown, 1 means Wi-Fi, 2 means 2G, 3 means 3G, 4 means 4G, 5 means 5G
mfr = the cellphone manufacturer.
period = time period of a day.
Use the data set "RTB.csv" to address the following questions. You may use the function rtb_new=pd.get_dummies(rtb) to transform the categorical variables in the data frame rtb into dummy variables (i.e., 0-1 variables). For building regularized logistic regression models, please
1
use the penalty and C parameters in the LogisticRegression module of sklearn. Refer to the online documentation for details: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
(a) (1 point) Build an L2-regularized logistic regression model and use cross-validation to select the best regularization parameter. Please include all the features into the model.
(b) (1 point) Build an L1-regularized logistic regression model and use cross-validation to select the best regularization parameter. Please include all the features into the model.
(c) (1 point) Consider the models you build in parts (a)-(b). Which model you would recommend to predict the click-throughs of this DSP? Report your model's generalization error (0-1 loss) and out-of-sample ROC-AUC. Which factor do you think is most important in predicting the click-throughs of this DSP? Based on your results, what recommendation(s) would you have about the advertising strategies for the DSP?
2. Predicting Titanic Survivals (3 points)
In this problem, you will try to build random forest and gradient boosting tree models to pre- dict whether passengers on Titanic could survive. We use the data set Titanic_Survival.csv, which contains the demographic information of passengers on Titanic and their survival status. This data set has the following variables:
Passengerid: The ID of the passenger Survived: Whether the passenger survived (1=Yes, 0=No) PClass: Ticket class (1=1st, 2=2nd, 3=3rd) Sex: Male or female Age: Age of the passenger SibSp: Number of siblings or spouses aboard the Titanic for the passenger Parch: Number of parents or children aboard the Titanic for the passenger Ticket: Ticket number of the passenger Fare: Passenger fare Cabin: Cabin number Embarked: Port of embarkation (S=Southampton, C=Cherbourg, Q=Queenstown) Name: Name of the passenger
Remark: You can also try your results on this forever-ongoing Kaggle competition: https://www. kaggle.com/c/titanic/overview.
2
(a) (1.5 points) Please use cross-validation to train and validate a random forest model to predict whether a passenger survived in the Titanic tragedy. Report the out-of-sample ROC-AUC and overall accuracy of our selected model.
(b) (1.5 points) Please use cross-validation to train and validate a gradient boosting tree model to predict whether a passenger survived in the Titanic tragedy. Report the out-of-sample ROC-AUC and overall accuracy of our selected model.