Question: FOR PYTHON a. Write a program, hamAndSpam(smsFilename), that analyzes word frequencies in real-world text messages. Text file SMSSpamCollection.txt contains 5574 SMS messages. There is additional

FOR PYTHON

a. Write a program, hamAndSpam(smsFilename), that analyzes word frequencies in real-world text messages. Text file SMSSpamCollection.txt contains 5574 SMS messages. There is additional information about the contents of the file in the associated "readme" file readmeSMSSpamCollection.txt, written by the creators of the dataset. The data was originally from this no-longer-working link: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. Some information about the data set and its initial investigators is now here. Each line of the file is represents one SMS/text message. The first item on every line is a label - 'ham' or 'spam' - indicating whether that line's SMS is considered spam or not. The rest of the line contains the text of the SMS/message. For example:

spam Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! Call ... ham Sorry, I'll call later in meeting.

At the end, your program must print summary information and information about the most frequent words in spam messages and the most frequent words in non-spam (ham) messages. It should also compute and compare the average lengths of spam and ham messages. I will not specify exactly what your output should be (but I will demonstrate sample output during the next lecture or two. I will also provide organizational hints and help for each of the parts). To accomplish this, your hamAndSpam function should:

read all of the data from the input file

extract individual words from the messages. This should include an effort to get ride of "extras" such as periods, commas, question and exclamation marks, and other characters that aren't part of a word. You should probably also ignore capitalization. Thus in the sample spam message above, you probably want to treat "Congrats!" as "congrats" in your frequency analysis.

build two dictionaries (required for full credit on this assignment), one for frequencies of words appearing in spam messages, one for frequencies of words from ham messages.

print summary information and some word frequency information about the data.

Again, it is up to you to decide exactly what to print, though it must include some word frequency and message length information as mentioned above. Summary information might include the number of spam/non-spam messages, the total number of different words in spam and non-spam messages, the total number of words in each, and anything else that might be interesting (does spam or non-spam have longer average word length?? longer message length??). Frequency information might be in the form of the top ten most used words in spam and in non-spam, along with a measure of their frequency (is a absolute number of occurrences a good measure? Or might it be better to use a fraction/percentage of all occurrences in that type of message). Possibly also consider printing information about most frequent words with more than, say, one or two or three letters - the results might be more enlightening. (You could also, but it is not required, compare the results with the list of 5000 most common English words of the file words5000.txt - most common word first - from HW4.) b. Write a couple of sentences/short paragraph saying something about the results. Can you conclude something about spam vs. non-spam? Did you learn something? Put this answer as a comment at the top of your .py file. Thus, your file should look like:

# 1b. ... your answer here ... # .... # # def hamAndSpam(smsFilename): ...

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Write program analyzeMessages(filename, minWordLengthToConsider = 1) that analyzes word frequencies in real-world text messages. Text file SMScollection.txt contains 5574 SMS messages. There is...

FOR PYTHON a. Write program analyzeMessages(filename, minWordLengthToConsider = 1) that analyzes word frequencies in real-world text messages. Each line of the file is represents one SMS/text...

Use python a. Write program analyzeSMS(flename) that analyzes word frequencies in real-world text messages. Text fhle SMScollection.txt contains 5574 SMS messages. There is additional information...

There are two problems due this week (each worth 35 points) as follows. Case 5-1David L. Miller: Portrait of a White-Collar Criminal (page 144). In comprehensive paragraphs, answerrequirements 1?6....

SFTY 330 - Aircraft Accident Investigation Aircraft Accident Project \".....The devil is in the details\" Assignment: This assignment tests your ability to apply the lessons and information learned...

Task 2 - Unclosed Files This task simply demonstrates the importance of closing a file once you have finished using it. Create a new file in the Python Editor and copy-paste the following code into...

1. Write Python code using tkinter to produce the following GUI. Use the pack layout for this. Hello 131 Class When we enlarge the window, it should look like this: GUI Workbook Hello 131 Class 2....

This question asks you to write programs in assembly language and Python. You are recommended to use no more than 350 words in your answer to this question. Taxi fares are usually composed of two...

Competency In this project, you will demonstrate your mastery of the following competency: Utilize various programming languages to develop secure, efficient code Scenario You are doing a fantastic...

(c) Sobolo Ltd is a Ghanaian company that manufactures a special type of energy drink for export to Italy. The local currency of Sobolo Ltd is the Ghanaian Cedi. Sobolos sales prices are denominated...

List five entrepreneur-owned businesses in your community. In which industry does each business compete? Based on the industry, how do you rate each businesss long-term chances for success? Explain...

What primary risks does Shiller associate with long term bonds. Liquidity Risk credit risk inflation risk default risk

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

Question Can a Section 401(k) plan participant make deductible contributions to a traditional IRA as well as salary reductions under a Section 401(k) plan?

Question Can a one-person business (proprietorship or corporation) adopt an individual 401(k) plan for the owner?

Question How can an ESOP or stock bonus plan be used to carry out a corporate buy-sell agreement among shareholders?