Question: Now implement sqsplit, which takes as input a data set of size n d with labels and computes the best feature and the threshold /
Now implement sqsplit, which takes as input a data set of size nd with labels and computes the best feature and the thresholdcut of the optimal split based on the squared loss impurity. The function outputs a feature dimension feature d a cut threshold cut, and the impurity loss bestloss of this best split.
Recall in the CART algorithm that, to find the split with the minimum impurity, you iterate over all features and cut values along each feature. We enforce that the cut value be the average of the two consecutive data points' feature values.
You should calculate the impurity of a node of data S with two branches SL and SR as:
in in in in ISSLSISLSRSISRSxy in SLyySLSxy in SRyySRxy in SLyySLxy in SRyySR
Implementation Notes:
For calculating the impurity of a node, you should just return the sum of left and right impurities instead of the average.
Returned feature must be indexed as is consistent with programming in Python.
If along a feature f two data points xi and xj have the same value, avoid splitting between them; move to the next pair of data points.
For example, with the following xTr of size and yTr for points:
among possible features the best split would be atfeature andcut
If you're stuck, we recommend that you start with the nave algorithm for finding the best split, which involves a double loop over all features f d and all cut values xTr fxTri f xTri f xTrn fwith xTr sorted along feature f This algorithm thus calculates impurities for dn splits.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
