Question: Part 5: Creating Training, Validation, and Test Sets In this section, we will encode our categorical variables and will create training, validation, and test sets.

Part 5: Creating Training, Validation, and Test Sets In this section, we will encode our categorical variables and will create training, validation, and test sets. Create a markdown cell that displays a level 2 header that reads: "Part 5: Creating Training, Validation, and Test Sets". Also add some text briefly describing the purpose of your code in this part. Explain that we will start by separating the categorical features, the numerical features, and the labels. Before moving on to the next step, note that we will be using Cover_Type as the label variable in our models. All other columns will be used as features. Of the feature columns, Wilderness_Area and Soil_Type are categorical, while all other feature columns are numerical. Perform the following steps in a single code cell: Create a 2D array named X_num by selecting the columns of fc that represent numerical features. Create a 2D array named X_cat by selecting the columns of fc that represent categorical features. Create a 1D array named y by selecting the column of fc corresponding to the labels. Print the shapes of all three of these arrays with messages as shown below. Add spacing to ensure that the shape tuples are left-aligned. Numerical Feature Array Shape: xxxx Categorical Feature Array Shape: xxxx Label Array Shape: xxxx Note: The variables created here should be arrays, and not DataFrames or Series. You will need to use .values. Create a markdown cell explaining that we will now be encoding the categorical variables using one-hot encoding. Perform the following steps in a single code cell: 1. Create a OneHotEncoder() object setting sparse=False. 2. Fit the encoder to the categorical features. 3. Use the encoder to encode the categorical features, storing the result in a variable named X_enc. 4. Print the shape of X_enc with a message as shown below. Encoded Feature Array Shape: xxxx Create a markdown cell explaining that we will now combine the numerical features with the encoded features. Perform the following steps in a single code cell: 1. Use np.hstack to combine X_num and X_enc into a single array named X. 2. Print the shape of X with a message as shown below. Feature Array Shape: xxxx Create a markdown cell explaining that we will now split the data into training, validation, and test sets, using a 70/15/15 split. Perform the following steps in a single code cell: Use train_test_split() to split the data into training and holdout sets using an 70/30 split. Name the resulting arrays X_train, X_hold, y_train, and y_hold. Set random_state=1. Use stratified sampling. Use train_test_split() to split the holdout data into validation and test sets using a 50/50 split. Name the resulting arrays X_valid, X_test, y_valid, and y_test. Set random_state=1. Use stratified sampling. Print the shapes of X_train, X_valid, and X_test with messages as shown below. Add spacing to ensure that the shape tuples are left-aligned. Training Features Shape: xxxx Validation Features Shape: xxxx Test Features Shape: xxxx

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 2: Census Dataset In Problem 2, you ou will be using census data from 1994 to attempt to predict whether or not a person has an annual salary greater than $50,000 based on other information...

I need two things in this one. 1. complete the steps 9 and 10 tables using with math lab. 2. Explain this process how working on the math lab?( explain upper task how doing with words) 6. Bare nuclei...

**Q.2 ** Universal Bank is a relatively young bank growing rapidly in terms of overall customer acquisition. The majority of these customers are liability customers (depositors) with varying sizes of...

Here is the question and following with data called Bank.csv Using Python Topic - KNN and NBC KNN problems Relatively young bank growing rapidly in terms of overall customer acquisition. The majority...

CSC 411 / CSC 2515 Introduction to Machine Learning ASSIGNMENT # 1 Due at NOON on: Oct. 19 (CSC 411) / Oct. 20 (CSC 2515) 1 Logistic Regression (40 points) 1.1 (10 points) Bayes' Rule Suppose you...

solve this question in the document Homework 3 Background: As a hotel chain, o he biggest issues in managing capacity occurs when guests cancel their reservations. You work for a hotel chain and are...

I need help figuring this out. This assignment is supposed to gain some hands-on experience using the Keras library. learn how to modify neural network parameters in Keras and how to understand the...

A 500kVA 11000/220 V 1-phase transformer is connected to a load at 220V and power factor of 0.8 lagging. The iron loss of the transformer is 500W and the circuit parameters of the transformer are...

Problem Description: In order to transmit data across a cable, a sender transmits a series of high and low voltages which a receiver may interpret as 1 s and 0 s . This may be further abstracted by...

Which of the following statements is CORRLCT? Uhing accelicrated depreclation rather than straight line normally has the eflect of speeding up carsh flowes and thus increasing a projects forecansed...

8:37 * N. 80% i ... OBJECTIVES: Create relationships Create a Pivot Table from Related Tables Create a PivotChart Modify the PivotChart The major section in this chapter :ontinuation is: Data...