Question: FOR PYTHON a. Write a program, hamAndSpam(smsFilename), that analyzes word frequencies in real-world text messages. Text file SMSSpamCollection.txt contains 5574 SMS messages. There is additional

FOR PYTHON

a. Write a program, hamAndSpam(smsFilename), that analyzes word frequencies in real-world text messages. Text file SMSSpamCollection.txt contains 5574 SMS messages. There is additional information about the contents of the file in the associated "readme" file readmeSMSSpamCollection.txt, written by the creators of the dataset. The data was originally from this no-longer-working link: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. Some information about the data set and its initial investigators is now here. Each line of the file is represents one SMS/text message. The first item on every line is a label - 'ham' or 'spam' - indicating whether that line's SMS is considered spam or not. The rest of the line contains the text of the SMS/message. For example:

spam Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! Call ... ham Sorry, I'll call later in meeting. 

At the end, your program must print summary information and information about the most frequent words in spam messages and the most frequent words in non-spam (ham) messages. It should also compute and compare the average lengths of spam and ham messages. I will not specify exactly what your output should be (but I will demonstrate sample output during the next lecture or two. I will also provide organizational hints and help for each of the parts). To accomplish this, your hamAndSpam function should:

read all of the data from the input file

extract individual words from the messages. This should include an effort to get ride of "extras" such as periods, commas, question and exclamation marks, and other characters that aren't part of a word. You should probably also ignore capitalization. Thus in the sample spam message above, you probably want to treat "Congrats!" as "congrats" in your frequency analysis.

build two dictionaries (required for full credit on this assignment), one for frequencies of words appearing in spam messages, one for frequencies of words from ham messages.

print summary information and some word frequency information about the data.

Again, it is up to you to decide exactly what to print, though it must include some word frequency and message length information as mentioned above. Summary information might include the number of spam/non-spam messages, the total number of different words in spam and non-spam messages, the total number of words in each, and anything else that might be interesting (does spam or non-spam have longer average word length?? longer message length??). Frequency information might be in the form of the top ten most used words in spam and in non-spam, along with a measure of their frequency (is a absolute number of occurrences a good measure? Or might it be better to use a fraction/percentage of all occurrences in that type of message). Possibly also consider printing information about most frequent words with more than, say, one or two or three letters - the results might be more enlightening. (You could also, but it is not required, compare the results with the list of 5000 most common English words of the file words5000.txt - most common word first - from HW4.) b. Write a couple of sentences/short paragraph saying something about the results. Can you conclude something about spam vs. non-spam? Did you learn something? Put this answer as a comment at the top of your .py file. Thus, your file should look like:

# 1b. ... your answer here ... # .... # # def hamAndSpam(smsFilename): ... 

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!