Question: For my statistical learning class. We are suppose to use R to analyze. In practical problems, a data set doesnt alway come clean. Such is

For my statistical learning class. We are suppose to use R to analyze.

In practical problems, a data set doesnt alway come clean. Such is the case with identifying if a message is a spam or not. This, thus, requires you to use natural language processing (NLP) as part of your work to classify data. This assignment is to make you familiar with such a practical problem.

a. First, do the entire steps discussed in https://rpubs.com/pparacch/237109 to do naive Bayes classification on a dataset consisting of SMS messages. The data set on SMS messages is discussed at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ and can be downloaded from http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip

Warning: the SMS dataset contains offensive words.

Youll note that the data set as given in the zip file (after unzip) needs to be processed to do the following:

- "\t" is to be replaced by ",".

- Double quote (") in the free text needs to be replaced by single quote (').

- Then, the sms text is to be included in "

You will need to do this pre-processing yourself. (If it helps, you may use the awk code, available here process_sms.awk , to do this pre-processing). Youll also note that youd have to install a number of packages as listed as required at the beginning of the link. Be sure to include the wordcloud figure with your submission. Report also if the answers you got are different from the ones available at the link and the possible reason for disparity.

(ADDED REQUIREMENTS:)

For the part 'Evaluate the Model', use either 80/20-rule with randomization for 100 replications, or k-fold cross-validation.

b) Now you are to consider a subset of 500 SMS messages from the original dataset using your last 4 digits of your student ID as the seed (set.seed(nnnn), where nnnn is the last 4 digits of your student ID) through sampling, using 'sample'. On this 500 SMS message in your collection, youll then do 80/20-rule for training set/test data set split from YOUR data set. And repeat the above work performed in a) above. Report on how the results for your set varies from the original dataset (be sure to include the wordcloud figure for your dataset alongside the original data set for visual comparison).

(ADDED REQUIREMENTS:)

For the part 'Evaluate the Model', use either 80/20-rule with randomization for 100 replications, or k-fold cross-validation.

(ADDED NOTE-2):

First include a text summarizing your KEY observations and any issues (this can be a page or so in single-space). Following this, include the output from R. From the text, you may include some pointers to the R output where your observation comes from.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!