Question: P1 Make use of the scikit-learn (sklearn) python package in your function implementations Complete train_test_split function Using the train_test_split function from sklearn implement a function

P1

Make use of the scikit-learn (sklearn) python package in your function implementations

Complete train_test_split function

Using the train_test_split function from sklearn implement a function that given a dataset, target column, test size, random state and True/False Value for stratify will return train_features (DataFrame), test_features (DataFrame), train_targets (Series) and test_targets (Series)

Hint: write your code in a way that handles a case where we want to stratify vs where we dont want to stratify (dont use stratify directly as an input to the sklearn function)

Complete PreprocessDataset class in task2.pyone_hot_encode_columns_train

Given training features (DataFrame), one_hot_encode_cols (list of column names) and using sklearns OneHotEncoder. Split the data into columns that should be encoded and those that should be passed through then fit the encoder, transform the training data. You will keep the column names the same as their input column names.(NOTE: make sure this new df uses the row indexes corresponding to the input dataframe). Finally join the encoded columns with the columns from above that should be passed through. Your final results should be a Dataframe with columns in the one_hot_encode_cols list encoded and all other columns untouched.

one_hot_encode_columns_test

Given training features (DataFrame), one_hot_encode_cols (list of column names) and using sklearns OneHotEncoder, split the data into columns that should be encoded and those that should be passed through then using the encoder fit on the training data, transform the test data. You will keep the column names the same as their input column names. (NOTE: make sure this new df uses the row indexes corresponding to the input dataframe). Finally join the encoded columns with the columns from above that should be passed through. Your final results should be a Dataframe with columns in the one_hot_encode_cols list encoded and all other columns untouched.

min_max_scaled_columns_train

Given training features (DataFrame), min_max_scale_cols (list of column names) and using sklearns MinMaxScaler, Split the data into columns that should be scaled and those that should be passed through then fit the scaler, transform the training data and create a dataframe with column names the same as the pre-scaled feature.(NOTE: make sure this new df uses the row indexes corresponding to the input dataframe) Finally join the encoded columns with the columns from above that should be passed through. Your final results should be a Dataframe with columns in the min_max_scale_cols list scaled and all other columns untouched.

min_max_scaled_columns_test

Given training features (DataFrame), min_max_scale_cols (list of column names) and using sklearns MinMaxScaler, split the data into columns that should be scaled and those that should be passed through then using the scaler fit on the training data, transform the test data and create a dataframe with column names the same as the pre-scaled feature. (NOTE: make sure this new df uses the row indexes corresponding to the input dataframe) Finally join the encoded columns with the columns from above that should be passed through. Your final results should be a Dataframe with columns in the min_max_scale_cols list scaled and all other columns untouched.

pca_train

Given training features (DataFrame), n_components (int) and using sklearns PCA, initialize PCA with a random seed of 0 and n_components, train PCA on the training features dropping any columns that have NA values then transform the training set using PCA and create a DataFrame with column names component_1, component_2 .. component_n for each component you created (NOTE: this new df (for the autograder this semester) should have an index from 0 to n which will not match the the row indexes corresponding to the input dataframe)

pca_test

Given test features (DataFrame), n_components (int) and using sklearns PCA, using the above trained PCA and dropping any columns that have NA values then transform the test set using PCA and create a DataFrame with column names component_1, component_2 .. component_n for each component you created (NOTE: this new df (for the autograder this semester) should have an index from 0 to n which will not match the the row indexes corresponding to the input dataframe)

feature_engineering_train

Given training features (DataFrame) and a with the feature engineering functions passed in a dict with the format {'feature_name':function,} for each feature_name in the dict, create columns of name in the training DataFrame by passing the training feature dataframe to the associated function. The Returned Dataframe will consist of the input dataframe with the additional feature engineered columns from the dict (NOTE: make sure this new df uses the row indexes corresponding to the input dataframe)

feature_engineering_test

Given test features (DataFrame) and a with the feature engineering functions passed in a dict with the format {'feature_name':function,} for each feature_name in the dict, create columns of name in the test DataFrame by passing the test feature dataframe to the associated function. The Returned Dataframe will consist of the input dataframe with the additional feature engineered columns from the dict (NOTE: make sure this new df uses the row indexes corresponding to the input dataframe)

preprocess

Given a Training Features (DataFrame), Test Features (DataFrame) and the functions you created above, return Training and Test Dataframes with the one_hot_encode_cols encoded, min_max_scale_cols scaled, features described in the feature_engineering_functions engineered and any columns not affected by the above functions passed through to the output the same as they were in the input. (NOTE: make sure this new df uses the row indexes corresponding to the input dataframe)

import numpy as np import pandas as pd import sklearn.preprocessing import sklearn.decomposition import sklearn.model_selection

def train_test_split( dataset: pd.DataFrame, target_col: str, test_size: float, stratify: bool, random_state: int) -> tuple[pd.DataFrame,pd.DataFrame,pd.Series,pd.Series]: # TODO: Write the necessary code to split a dataframe into a Train and Test feature dataframe and a Train and Test # target series train_features = pd.DataFrame() test_features = pd.DataFrame() train_targets = pd.DataFrame() test_targets = pd.DataFrame() return train_features,test_features,train_targets,test_targets

class PreprocessDataset: def __init__(self, train_features:pd.DataFrame, test_features:pd.DataFrame, one_hot_encode_cols:list[str], min_max_scale_cols:list[str], n_components:int, feature_engineering_functions:dict ): # TODO: Add any state variables you may need to make your functions work return

def one_hot_encode_columns_train(self) -> pd.DataFrame: # TODO: Write the necessary code to create a dataframe with the categorical column names in # the variable one_hot_encode_cols "one hot" encoded one_hot_encoded_dataset = pd.DataFrame() return one_hot_encoded_dataset

def one_hot_encode_columns_test(self) -> pd.DataFrame: # TODO: Write the necessary code to create a dataframe with the categorical column names in # the variable one_hot_encode_cols "one hot" encoded one_hot_encoded_dataset = pd.DataFrame() return one_hot_encoded_dataset

def min_max_scaled_columns_train(self) -> pd.DataFrame: # TODO: Write the necessary code to create a dataframe with the numerical column names in # the variable min_max_scale_cols scaled to the min and max of each column return min_max_scaled_dataset

def min_max_scaled_columns_test(self) -> pd.DataFrame: # TODO: Write the necessary code to create a dataframe with the numerical column names in # the variable min_max_scale_cols scaled to the min and max of each column return min_max_scaled_dataset def pca_train(self) -> pd.DataFrame: # TODO: use PCA to reduce the train_df to n_components principal components # Name your new columns component_1, component_2 .. component_n return pca_dataset

def pca_test(self) -> pd.DataFrame: # TODO: use PCA to reduce the test_df to n_components principal components # Name your new columns component_1, component_2 .. component_n return pca_dataset

def feature_engineering_train(self) -> pd.DataFrame: # TODO: Write the necessary code to create a dataframe with feature engineering functions applied # from the feature_engineering_functions dict (the dict format is {'feature_name':function,}) # each feature engineering function will take in type pd.DataFrame and return a pd.Series feature_engineered_dataset = pd.DataFrame() return feature_engineered_dataset

def feature_engineering_test(self) -> pd.DataFrame: # TODO: Write the necessary code to create a dataframe with feature engineering functions applied # from the feature_engineering_functions dict (the dict format is {'feature_name':function,}) # each feature engineering function will take in type pd.DataFrame and return a pd.Series feature_engineered_dataset = pd.DataFrame() return feature_engineered_dataset

def preprocess(self) -> tuple[pd.DataFrame,pd.DataFrame]: # TODO: Use the functions you wrote above to create train/test splits of the features and target with scaled and encoded values # for the columns specified in the init function train_features = pd.DataFrame() test_features = pd.DataFrame() return train_features,test_features

P2

Now that you have written functions for different steps of the model building process you will put it all together. You will write code that trains a model with hyperparameters you determine (you should do any tuning locally or in a notebook ie don't tune your model in gradescope since the autograder will likely timeout). It will take in the CLAMP training data, train a model then predict on a test set and output values from 0 to 1 for each row and our autograder will compare your predictions with the correct answers and to get credit you will need a roc auc score of .9 or higher on the test set (should not require much hyperparameter tuning for this dataset). This is basically a simulation of how your model would perform in the production system using batch inference.

Deliverables:

Make use of any of the techniques we covered in this project to train a model and return predicted probabilities for each row of the test set as a DataFrame with columns index (same as your index from the input test df) and malware_score (predicted probabilities).

Complete the train_model_return_scores function in task5.py

import numpy as np import pandas as pd

def train_model_return_scores(train_df_path,test_df_path) -> pd.DataFrame: # TODO: Load and preprocess the train and test dfs # Train a sklearn model using training data at train_df_path # Use any sklearn model and return the test index and model scores

# TODO: output dataframe should have 2 columns # index : this should be the row index of the test df # malware_score : this should be your model's output for the row in the test df test_scores = pd.DataFrame() return test_scores

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!