Question: Task You are to import and clean the same HealthCareData _ 2 0 2 4 . csv , that was used in the previous assignment.

Task
You are to import and clean the same HealthCareData_2024.csv, that was used in the
previous assignment. Then run, tune and evaluate two supervised ML algorithms (each
with two types of training data) to identify the most accurate way of classifying
malicious events.
Part 1 General data preparation and cleaning
a) Import the HealthCareData_2024.csv into R Studio. This version is the same as
Assignment 1.
b) Write the appropriate code in R Studio to prepare and clean the
HealthCareData_2024 dataset as follows:
i. Clean the whole dataset based on the feedback received for Assignment 1.
ii. For the feature NetworkInteractionType, merge the Regular and
Unknown categories together to form the category Others. Hint: use the
forcats:: fct_collapse(.) function.
iii. Select only the complete cases using the na.omit(.) function, and name the
dataset dat.cleaned.
Briefly outline the preparation and cleaning process in your report and why you
believe the above steps were necessary.
c) Use the code below to generated two training datasets (one unbalanced
mydata.ub.train and one balanced mydata.b.train) along with the testing set
(mydata.test). Make sure you enter your student ID into the command
set.seed(.).
# Separate samples of normal and malicious events
dat.class0<- dat.cleaned %>% filter(Classification == "Normal") # normal
dat.class1<- dat.cleaned %>% filter(Classification == "Malicious") # malicious
# Randomly select 9600 non-malicious and 400 malicious samples using your student
ID, then combine them to form a working data set
set.seed(Enter your Student ID)
rows.train0<- sample(1:nrow(dat.class0), size =9600, replace = FALSE)
rows.train1<- sample(1:nrow(dat.class1), size =400, replace = FALSE)
# Your 10000unbalanced training samples
train.class0<- dat.class0[rows.train0,] # Non-malicious samples
train.class1<- dat.class1[rows.train1,] # Malicious samples
mydata.ub.train <- rbind(train.class0, train.class1)
# Your 19200balanced training samples, i.e.9600 normal and malicious samples e
ach.
set.seed(Enter your Student ID)
6| P a g e
train.class1_2<- train.class1[sample(1:nrow(train.class1), size =9600,
replace = TRUE),]
mydata.b.train <- rbind(train.class0, train.class1_2)
# Your testing samples
test.class0<- dat.class0[-rows.train0,]
test.class1<- dat.class1[-rows.train1,]
mydata.test <- rbind(test.class0, test.class1)
Note that in the master data set, the percentage of malicious events is
approximately 4%. This distribution is roughly represented by the unbalanced
data. The balanced data is generated based on up-sampling of the minority class
using bootstrapping. The idea here is to ensure the trained model is not biased
towards the majority class, i.e. normal events.
Part 2 Compare the performances of different ML algorithms
a) Randomly select two supervised learning modelling algorithms to test against
one another by running the following code. Make sure you enter your student ID
into the command set.seed(.). Your 2 ML approaches are given by myModels.
set.seed(Enter your student ID)
models.list1<- c("Logistic Ridge Regression",
"Logistic LASSO Regression",
"Logistic Elastic-Net Regression")
models.list2<- c("Classification Tree",
"Bagging Tree",
"Random Forest")
myModels <- c(sample(models.list1, size =1),
sample(models.list2, size =1))
myModels %>% data.frame

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!