Question: a)Implement an algorithm that creates a CART from performance data as presented in the lecture. The CART must be implemented by yourself, you must not
a)Implement an algorithm that creates a CART from performance data as presented in the lecture. The CART must be implemented by yourself, you must not take a prefabricated Python implementation. The format of the sample data and the representation of the CART are described below. If two split options are equally good, we use the alphabetic ordering of feature names as tie breaker. N.B: Don't use any additional python library except pandas.
The code template given is as follows: import pandas as pd cart = { "name":"X", "mean":456, "split_by_feature": "aes", "error_of_split": 0.0, "successor_left": { "name":"XL", "mean":1234, "split_by_feature": None, "error_of_split":None, "successor_left":None, "successor_right":None }, "successor_right":{ "name":"XR", "mean":258, "split_by_feature": None,"error_of_split":None, "successor_left":None, "successor_right":None } }
features = ["secompress", "encryption", "aes", "blowfish", "algorithm", "rar", "zip", "signature", "timestamp", "segmentation", "onehundredmb", "onegb"]
def get_cart(sample_set_csv): df = pd.read_csv(sample_set_csv) //TODO: Change the return and code logic here return cart And this is my code which needs to be fixed or you can change it: import pandas as pd
cart = { "name":"X", "mean":456, "split_by_feature": None, "error_of_split": 0.0, "successor_left": None, "successor_right": None } features = ["secompress", "encryption", "aes", "blowfish", "algorithm", "rar", "zip", "signature", "timestamp", "segmentation", "onehundredmb", "onegb"] def mean(numbers): return float(sum(numbers)) / max(len(numbers), 1)
def squared_error_loss(numbers, mean_value): return sum((n-mean_value)**2 for n in numbers)
def get_cart(sample_set_csv): df = pd.read_csv(sample_set_csv) return build_cart(df,features)
def build_cart(df,features, current_node_name="X"): current_mean = df["performance"].mean() current_error = ((df["performance"] - current_mean) ** 2).sum() if len(features) == 0: return { "name":current_node_name, "mean":current_mean, "split_by_feature": None, "error_of_split":current_error, "successor_left":None,"successor_right":None} else: best_feature = None best_error = 999.0 best_left_subset = None best_right_subset = None for feature in features: for value in df[feature].unique(): left_subset = df[df[feature] == value] right_subset = df[df[feature] != value] if len(left_subset) == 0 or len(right_subset) == 0: continue error = sum_squared_error_loss(left_subset, right_subset) if best_feature == None or error
def sum_squared_error_loss(left, right): left_mean = left["performance"].mean() right_mean = right["performance"].mean() left_loss = ((left["performance"] - left_mean) ** 2).sum() right_loss = ((right["performance"] - right_mean) ** 2).sum() return left_loss + right_loss
# Unit Test for the task test_cart = {'name': 'X', 'mean': 763.2, 'split_by_feature': 'segmentation', 'error_of_split': 6.0, 'successor_left': {'name': 'XL', 'mean': 772.0, 'split_by_feature': 'onegb', 'error_of_split': 0.0, 'successor_left': {'name': 'XLL', 'mean': 770.0, 'split_by_feature': None, 'error_of_split': None, 'successor_left': None, 'successor_right': None }, 'successor_right': {'name': 'XLR', 'mean': 773.0, 'split_by_feature': None, 'error_of_split': None, 'successor_left': None, 'successor_right': None } }, 'successor_right': {'name': 'XR', 'mean': 750.0, 'split_by_feature': None, 'error_of_split': None, 'successor_left': None, 'successor_right': None} }
if get_cart("Performance_01.csv") == test_cart: print("passed") else: print("failed")
Note that the data in Performance_01.csv is: Id,secompress,encryption,aes,blowfish,algorithm,rar,zip,signature,timestamp,segmentation, onehundredmb,onegb,performance 0,1,0,0,0,1,1,0,0,0,0,0,0,750 1,1,0,0,0,1,1,0,0,0,1,1,0,773 2,1,0,0,0,1,1,0,0,0,1,0,1,770 3,1,0,0,0,1,1,0,0,1,0,0,0,750 4,1,0,0,0,1,1,0,0,1,1,1,0,773 I provided the code with a unit test. Don't submit an answer if the unit test gives you "Failed" or I will downvote the answer and send email to chegg because this is the third time I asked the same question and no code is working. And don't tell me write a comment since there is no place to write comment here in chegg.
CART Data Structure For this assignment sheet we use the following internal datastructure for a CART. We represent a CART as a python dict with exactly the entries as shown in the example below. This example CART has three nodes. The root node "X", and the two child nodes "XL" and "XR". As you can see the child nodes only have a name and a mean but all other fields are set to None. A parent node also has a name and a mean but additionally a feature by which the split is performed, the error of the split and two sucessors. It is important to use exaclty these names for the features: features = ["secompress", "encryption", "aes", "blowfish", "algorithm", "rar", "zip", "signature", "timestamp", "segmentation", "onehundredmb", "onegb"]
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
