Question: Prepare Text for Machine Learning If we are to create a classifier for text, we'll first need to think about the format of our data.

Prepare Text for Machine Learning
If we are to create a classifier for text, we'll first need to think about the format of our data. Take a look at the files girls.train and boys.train. For example with the unix command:
cat girls.train
...
Addisyn
Danika
Emilee
Aurora
Julianna
Sophia
Kaylyn
Litzy
Hadassah
This file contains names that are more or less commonly used for girls. The problem with the current data in this file is that the names are in plain text, which is not a format our machine learning algorithm can work with effectively. You need to transform these plain text names into some vector format, where each name becomes a vector that represents a point in some high dimensional input space.
That is exactly what the following Python function name2features does, by arbitrarily chunking and hashing different string extractions from each baby name inputted, thus transforming the string into a quantitative feature vector:
def hashfeatures(baby, d, FIX, debug=False):
"""
Input:
baby : a string representing the baby's name to be hashed
d: the number of dimensions to be in the feature vector
FIX: the number of chunks to extract and hash from each string
debug: a bool for printing debug values (default False)
Output:
v: a feature vector representing the input string
"""
v = np.zeros(d)
for m in range(1, FIX+1):
prefix = baby[:m]+">"
P = hash(prefix)% d
v[P]=1
suffix ="<"+ baby[-m:]
S = hash(suffix)% d
v[S]=1
if debug:
print(f"Split {m}/{FIX}:\t({prefix},{suffix}),\t1s at indices [{P},{S}]")
if debug:
print(f"Feature vector for {baby}:
{v.astype(int)}
")
return v

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!