Question: Build a spam classifier by two methods, first through unsupervised learning (K-Means Clustering) and then by multinomial Naive Bayes. Data The spam.csv datafile contains a

 Build a spam classifier by two methods, first through unsupervised learning

Build a spam classifier by two methods, first through unsupervised learning (K-Means Clustering) and then by multinomial Naive Bayes. Data The "spam.csv" datafile contains a collection of emails that have been classified as "spam" or "ham" (not spam). Load the "spam.csv" datafile from canvas. Displaying the top 5 rows (via the head method of the pandas library) should produce the following. label message Unnamed: 2 Unnamed: 3 Unnamed: 4 Go until jurong point, crazy.. Available only... NaN NaN Ok lar... Joking wif u oni... NaN NaN NaN 2 spam Free entry in 2 a wkly comp to win FA Cup fina... U dun say so early hor... U c already then say... ham Nah I don't think he goes to usf, he lives aro... NaN NaN 0 ham NaN 1 ham NaN NaN NaN 3 ham NaN NaN NaN 4 NaN The label (target) associated with each message is in the first column and the actually email message is located in the second column (the other columns are not used). Separate the data into a training and test set, where training is composed of 70% and testing is 30% of the data. You can use the train_test_split function in scikit-learn. Unsupervised Learning Cluster all the training data into two categories using K-Means clustering. The feature vector to use for the clustering operation is TF-IDF. Make sure to preprocess the data with the lemmatizer and remove proper nouns (names from the NLTK corpus) and only keep alphabetic tokens. Predict into which of the two categories each document in the test set will be placed (i.e. use the predict method of the kmeans object). Display the top 25 tokens from both clusters by way of the following process: Store the maximum weight of each token in the TF-IDF vector of the documents labeled 'ham' Store the maximum weight of each token in the TF-IDF vector of the documents labeled 'spam' Create a list of pairings of token and weights (ordered by weight in decreasing order) Use a for loop to display the top 25 tokens for each class doesn't have to be in two columns)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!