In this problem, we consider splitting when building a regression tree in the CART algorithm. We...

Fantastic news! We've Found the answer you've been seeking!

Question:

In this problem, we consider splitting when building a regression tree in the CART algorithm. We assume that

Transcribed Image Text:

In this problem, we consider splitting when building a regression tree in the CART algorithm. We assume that there is a feature vector X ERP and dependent variable Ye R. We have collected a training dataset (x, y),..., (En, Yn), where x R and y R for all i= 1,...,n. We also assume, for simplicity, that we are considering the initial split at the top (root node) of the tree. An arbitrary split simply divides the training dataset into a partition of size two. By appropriately reshuffling the data, we can represent this partition (again for simplicity) via two sub-datasets (x, y),..., (N, yN) and (TN+1, YN+1),..., (En, Yn) where N is the index of the last observation included in the first set. Assume throughout that our impurity function is the RSS error the standard choice for a regression tree. Please answer the following: a) (5 points) What is the total impurity value before the split? (This is the total impurity of the "null tree" or the "baseline model".) b) (5 points) What is the total impurity value after the split? (This is the total impurity of the tree with the split as defined above.) c) (10 points) Show that the total impurity value after the split is always less than or equal to the total impurity value before the split, i.e., splitting never increases the total impurity cost function. (Hint: you can use the fact that, given a sequence of real numbers 21, 22,..., Zn, the mean z = 1 is the minimizer of the function RSS(z) = 1(zi - z).) n In this problem, we consider splitting when building a regression tree in the CART algorithm. We assume that there is a feature vector X ERP and dependent variable Ye R. We have collected a training dataset (x, y),..., (En, Yn), where x R and y R for all i= 1,...,n. We also assume, for simplicity, that we are considering the initial split at the top (root node) of the tree. An arbitrary split simply divides the training dataset into a partition of size two. By appropriately reshuffling the data, we can represent this partition (again for simplicity) via two sub-datasets (x, y),..., (N, yN) and (TN+1, YN+1),..., (En, Yn) where N is the index of the last observation included in the first set. Assume throughout that our impurity function is the RSS error the standard choice for a regression tree. Please answer the following: a) (5 points) What is the total impurity value before the split? (This is the total impurity of the "null tree" or the "baseline model".) b) (5 points) What is the total impurity value after the split? (This is the total impurity of the tree with the split as defined above.) c) (10 points) Show that the total impurity value after the split is always less than or equal to the total impurity value before the split, i.e., splitting never increases the total impurity cost function. (Hint: you can use the fact that, given a sequence of real numbers 21, 22,..., Zn, the mean z = 1 is the minimizer of the function RSS(z) = 1(zi - z).) n

Related Book For answer-question

answer-question

Business Intelligence And Analytics Systems For Decision Support

Business Intelligence And Analytics Systems For Decision Support

ISBN: 9781292009209

10th Global Edition

Authors: Efraim Turban, Ramesh Sharda, Dursun Delen, Pearson Education Limited, Dennis G. Zill

See More Books

Posted Date: Oct 09, 2023 04:03 AM

See More Questions