Question: Problem 2 : Preprocessing We will now create a pipeline to process our dataset. This processing will involve indexing and encoding the categorical features and
Problem : Preprocessing
We will now create a pipeline to process our dataset. This processing will involve indexing and encoding the
categorical features and then combining all of the features into vectors.
Create lists named numfeatures and catfeatures to store the names of columns representing
numerical and categorical features. The numerical features are age, avgglucoselevel, and bmi. All
other features are categorical.
Create lists named ixfeatures and vecfeatures to store the names of the integerencoded categorical
columns and the onehot encoded categorical columns respectively
Create a StringIndexer object that uses the columns named in the list catfeatures to create the
columns named in the list ixfeatures.
Create a OneHotEncoder object that uses the integerencoded features named in ixfeatures to create
the onehot encoded categorical features named in vecfeatures. Do not drop the last columns.
Create a VectorAssembler object that combines the numerical features and the onehot encoded vectors
for the categorical features. The combined column should be named features.
We will now create a pipeline from the stages above and will apply this to our data.
Create a pipeline consisting of the StringIndexer, OneHotEncoder, and VectorAssembler objects. Fit
this pipeline to the strokedf DataFrame, and then apply the fitted pipeline to strokedf Store the
processed DataFrame in a variable named train.
Persist the train DataFrame. Then display the first rows of the features and stroke columns of
train, setting truncateFalse.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
