using R STAT 2450 Assignment 7 (32 points) '' Problem Surviving the Titanic (32 points) Load the libraries library( rpart ) library( tree ) library(ggplot2) library(randomForest) randomForest 4 7 1 Type rfNews() to see new features changes bug fixes Attaching package 'randomForest' The following object is masked from 'package ggplot2' margin Load the data mytrain read csv( https mathstat dal ca fullsack DATA titanictrain csv ) mytest read csv( https mathstat dal ca fullsack DATA titanictest csv ) mytitanic rbind(mytest,mytrain) nrec nrow(mytitanic) You will be using the column 'Survived' as the outcome in our models This should be treated as a factor All other columns are admissible as predictors of this outcome HINT 1 you can use the following template to split the data into folds, e g for cross validation Randomly shuffle your data yourData Create 10 pre folds of equal size myfolds use these pre folds for cross validation for(i in 1 10) loop over each of 10 folds recover the indexes of fold i and define the indexes of the test set testIndexes HINT 2 Use the following template to split data into a train and a test set of roughly the same size set seed(44182) or use the recommended seed trainindex sample(1 nrec,nrec 2,replace F) mytrain mydata trainindex, training set mytest mydata trainindex, testing set complementary subset of mydata Define a 5 pre folds of equal size of 'mytitanic' in a variable called 'myfolds' (2 points) set seed(2255) shuffle mytitanic Use pre fold number 3 to define a testing and a training set named 'mytest' and 'mytrain' (2 points) i 3 fold number to use Fit a Random Forest model to the 'mytrain' dataset Use the column 'Survived' as a factor outcome Require importance to be true and set the random seed to 523 (This is the 'trained model') (2 points) Fitting Random Forest Classification to the training set 'mytrain' Plot the trained model results Has the OOB error rate roughly equilibrated with 50 trees Has the OOB error rate roughly equilibrated with 500 trees What is the stationary value of the OOB error rate Which of death or survival has the smallest prediction error (4 points) Calculate the predictions on 'mytest', the misclassification error and the prediction accuracy (2 points) Predicting survival on mytest Print and plot the importance of predictors in the trained model (2 points) Now you are going to have a more direct look at predictors for the records in 'mytest' Tabulate the chances of survival by the column 'Title' What do you conclude (2 points) Which other predictor would have given you the same information (1 points) Are the predictors independent (1 points) What is the median fare of passengers (1 points) Hint use the column 'Fare' Tabulate the survival according to the binary variable mytest$Fare table(mytest$Fare 0 1 FALSE 43 46 TRUE 67 22 rm(mytrain,mytest) mytrain Complete the code of the following function, which returns a vector of classification accuracies for

The Answer is in the image, click to view ...

Question: //using R STAT 2450 Assignment 7 (32 points) '' Problem : Surviving the Titanic (32 points) Load the libraries library(rpart) library(tree) library(ggplot2) library(randomForest) ## randomForest

//using R

STAT 2450 Assignment 7 (32 points)

Problem : Surviving the Titanic (32 points)

Load the libraries

library("rpart") library("tree") library(ggplot2) library(randomForest)

## randomForest 4.7-1

## Type rfNews() to see new features/changes/bug fixes.

## ## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2': ## ## margin

Load the data

mytrain=read.csv("https://mathstat.dal.ca/~fullsack/DATA/titanictrain.csv") mytest=read.csv("https://mathstat.dal.ca/~fullsack/DATA/titanictest.csv") mytitanic=rbind(mytest,mytrain) nrec=nrow(mytitanic)

You will be using the column 'Survived' as the outcome in our models. This should be treated as a factor. All other columns are admissible as predictors of this outcome.

HINT-1: you can use the following template to split the data into folds, e.g.for cross-validation.

Randomly shuffle your data

yourData

Create 10 pre-folds of equal size

myfolds

use these pre-folds for cross-validation

for(i in 1:10){ # loop over each of 10 folds # recover the indexes of fold i and define the indexes of the test set testIndexes

HINT-2: Use the following template to split data into a train and a test set of roughly the same size

set.seed(44182) # or use the recommended seed trainindex=sample(1:nrec,nrec/2,replace=F) mytrain=mydata[trainindex,] # training set mytest=mydata[-trainindex,] # testing set = complementary subset of mydata

Define a 5 pre-folds of equal size of 'mytitanic' in a variable called 'myfolds'

(2 points)

set.seed(2255) # shuffle mytitanic

Use pre-fold number 3 to define a testing and a training set named 'mytest' and 'mytrain'

(2 points)

i=3# fold number to use

Fit a Random Forest model to the 'mytrain' dataset. Use the column 'Survived' as a factor outcome. Require importance to be true and set the random seed to 523. (This is the 'trained model'). (2 points)

# Fitting Random Forest Classification to the training set 'mytrain'

Plot the trained model results.

Has the OOB error rate roughly equilibrated with 50 trees?
Has the OOB error rate roughly equilibrated with 500 trees?
What is the stationary value of the OOB error rate?
Which of death or survival has the smallest prediction error? (4 points)

Calculate the predictions on 'mytest', the misclassification error and the prediction accuracy. (2 points)

# Predicting survival on mytest

Print and plot the importance of predictors in the trained model. (2 points)

Now you are going to have a more direct look at predictors for the records in 'mytest'.

Tabulate the chances of survival by the column 'Title'. What do you conclude? (2 points) Which other predictor would have given you the same information? (1 points)

Are the predictors independent? (1 points)

What is the median fare of passengers ? (1 points) Hint: use the column 'Fare'

Tabulate the survival according to the binary variable mytest$Fare

table(mytest$Fare

## ## 0 1 ## FALSE 43 46 ## TRUE 67 22

rm(mytrain,mytest) #mytrain

Complete the code of the following function, which returns a vector of classification accuracies for

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!

title: "STAT 2450 Assignment 7 (32 points)" author: "Your name here" date: 'Banner:B00??????' output: html_document: default pdf_document: default word_document: default --- # Problem : Surviving the...

Assignment: Chi-Square Test and Goodness of Fit Using R In this assignment, use R to understand the correlation between Metropolitan Statistical Areas (MSA) and crime types. Please use thebelow...

Assignment: Hypothesis Testing Using R In this assignment, use R to perform regression analysis. Please use the below Economic/political dataset (Excel) Download Economic/political dataset (Excel)to...

Instructions: Using Python and the data set provided, I want you to provide me the code and output that completes the following. That said, I will communicate the data structures of this assignment...

Regression Analysis Using R In this assignment, we use R to perform regression analysis. Please use the below Economic/political dataset (Excel) Economic/political dataset (Excel)to complete the...

PLEASE ANSWER using R studio . VARIABLE ASSIGNMENT AND SIMPLE FUNCTIONS ## ######################################################### #Section 2 instructions: #Answer the following questions by...

You will calculate descriptive statistics of a given data set including: mean median standard deviation, and five summary numbers. You will also create graphs with a given data set using R....

In this assignment, you will be required to write a Java program to keep track of a baseball team's statistics. A team consists of up to 40 players, each of whom has a certain number of hits and...

Done using R studio and please share the r script. Predict the number of applications received Apps using all other variables in the College data set using LASSO model for variable selection: a....

ID Salary Compa Midpoint Age Performance Service Gender Rating Raise Degree Gender 1 Gr Students: Copy the Student Data file data values into this sheet to assist in doing your weekly assignments....

As an international tax consultant to an MNE, what steps would you take to minimize tax obligations around the world?

Under what conditions can Vh models be used with reasona- ble representativity to describe the electric resistivity of shaly sands?

Corporation applause of red cost of jobs on a basis of 90% of direct labor cost if job to 10 shows 8100 of manufacturing overhead cost applied. How much was the direct labor cost on the job 9000...

Q3: (Please don't use handwriting + please read the attached because the answer should be from the attached) What are the major motivational issues at play in the fast food industry according to the...

Name at least three major pieces of legislation that supported the advancement of the human relations movement in management. Why were they effective in producing change?

Why do you think the majority of union elections are won or lost by management long before a union ever appears?

Why it is advisable to have a comprehensive policy governing solicitation and the posting of information on the premises in place before any signs of active union organizing appear?