Question: In this problem, we consider splitting when building a regression tree in the CART algorithm. We assume that there is a feature vector X

In this problem, we consider splitting when building a regression tree in the CART algorithm. We assume that

In this problem, we consider splitting when building a regression tree in the CART algorithm. We assume that there is a feature vector X ERP and dependent variable Ye R. We have collected a training dataset (x, y),..., (En, Yn), where x R and y R for all i= 1,...,n. We also assume, for simplicity, that we are considering the initial split at the top (root node) of the tree. An arbitrary split simply divides the training dataset into a partition of size two. By appropriately reshuffling the data, we can represent this partition (again for simplicity) via two sub-datasets (x, y),..., (N, yN) and (TN+1, YN+1),..., (En, Yn) where N is the index of the last observation included in the first set. Assume throughout that our impurity function is the RSS error the standard choice for a regression tree. Please answer the following: a) (5 points) What is the total impurity value before the split? (This is the total impurity of the "null tree" or the "baseline model".) b) (5 points) What is the total impurity value after the split? (This is the total impurity of the tree with the split as defined above.) c) (10 points) Show that the total impurity value after the split is always less than or equal to the total impurity value before the split, i.e., splitting never increases the total impurity cost function. (Hint: you can use the fact that, given a sequence of real numbers 21, 22,..., Zn, the mean z = 1 is the minimizer of the function RSS(z) = 1(zi - z).) n

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!