Question: How should I set up R code for this question? The files hotels_train.csv and hotels_test.csv contain data on tens of thousands of hotel stays from
How should I set up R code for this question?
The files hotels_train.csv and hotels_test.csv contain data on tens of thousands of hotel stays from a major U.S.-based hotel chain. The goal of this problem is simple: to use linear regression to build a machine-learning model for predicting whether a hotel booking will have children on it. Why would that be important? For an equally simple reason: when booking a hotel stay on a website, parents often enter the reservation exclusively for themselves and forget to include their children on the form. Obviously, the hotel isn't going to turn parents away from their room if they neglected to mention that their children would be staying with them. But not knowing about those children does, at least in the aggregate, prevent the hotel from making accurate forecasts of resource utilization. So if, for example, you could use the other features associated with a booking to forecast that a bunch of kids were going to show up unannounced, you might know to order more chicken nuggets for the restaurant and less tequila for the bar. (Or maybe more tequila, depending on how frazzled the parents who stay at your hotel tend to be.) In any event, as a hotel operator, if you can forecast the arrival of those kids a bit better, you can be just a bit more efficient, operationally speaking. This is an excellent use case for an ML model: a piece of software that can scan the bookings for the week ahead and produce an estimate for how likely each one is to have a "hidden" child on it.
The target variable of interest is children: a dummy variable for whether the booking has children on it. All other variables in the data set can be used to predict the children variable.
Please compare the out-of-sample performance (measuring using RMSE) of the following four models:
1. a small model that uses only the market_segment, adults, customer_type, and is_repeated_guest variables as features.
2. a big model that uses all the possible predictors except the arrival_date variable (main effects only).
3. a huge model that uses all the possible predictors except the arrival_date variable, along with all their possible pairwise interactions.
4. the big model (model 2 on this list), with one additional "engineered" feature: the month of the year, based on the arrival_date variable. (Remember our use of the lubridate package in R to doing this kind of feature engineering with dates.) Use the data in hotels_train.csv to fit the models.
Use the data in hotels_test.csv to calculate out-of-sample RMSE. Notes and requirements:
You don't need to report fitted model coefficients in your Results section. Really all your Results section needs to contain is a table with four rows (one for each model) and two columns (one for training-set RMSE, the other for test-set RMSE). Please report the RMSE numbers to four decimal places. Give the table an informative caption that describes what the table shows. Your Conclusions section should also be quite shortessentially just a recommendation about which model to use for predicting "hidden" children on hotel bookings.
It may take awhile for the huge model to fit on your machine. Be patient. 2
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
