Question: Prepare Text for Machine Learning If we are to create a classifier for text, we'll first need to think about the format of our data.
Prepare Text for Machine Learning
If we are to create a classifier for text, we'll first need to think about the format of our data. Take a look at the files girls.train and boys.train. For example with the unix command:
cat girls.train
Addisyn
Danika
Emilee
Aurora
Julianna
Sophia
Kaylyn
Litzy
Hadassah
This file contains names that are more or less commonly used for girls. The problem with the current data in this file is that the names are in plain text, which is not a format our machine learning algorithm can work with effectively. You need to transform these plain text names into some vector format, where each name becomes a vector that represents a point in some high dimensional input space.
That is exactly what the following Python function namefeatures does, by arbitrarily chunking and hashing different string extractions from each baby name inputted, thus transforming the string into a quantitative feature vector:
def hashfeaturesbaby d FIX, debugFalse:
Input:
baby : a string representing the baby's name to be hashed
d: the number of dimensions to be in the feature vector
FIX: the number of chunks to extract and hash from each string
debug: a bool for printing debug values default False
Output:
v: a feature vector representing the input string
v npzerosd
for m in range FIX:
prefix baby:m
P hashprefix d
vP
suffix babym:
S hashsuffix d
vS
if debug:
printfSplit mFIX:tprefixsuffixts at indices PS
if debug:
printfFeature vector for baby:
vastypeint
return v
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
