Question: This assignment will allow you to practice manipulating dictionaries and files in your python scripts. The goal is to detect the presence of foul language
This assignment will allow you to practice manipulating dictionaries and files in your python scripts. The goal is to detect the presence of foul language and keep track of trendy topics in a sample of Twitter data. You will need two files posted on the Piazza course website: twitter_data.txt and swear_words.txt. WARNING: Some of the tweets in the sample file actually do contain swear words.
1. Detecting Foul Language in Twitter Microblogging sites such as Twitter and Ask.fm are sometimes misused to abuse people. In this part of the assignment your task is to screen each tweet for the presence of swear words. We provide an initial list of bad words in the file named swear_words.txt. The file twitter_data.txt contains real tweets collected to study cyberbullying. Each line is a different tweet. Write a function that will read each tweet in the file, will look for swear words, and will write to a new file named potentially_offensive_tweets.txt all tweets containing foul language. Note that the sample may have repeated tweets as well as tweets in a foreign language. You may find the need to update your swear_words.txt file. Thats expected, as the list is not comprehensive.
2. Detecting Topic Trends in Twitter One of the services Twitter provides its users is the ability to track the most popular topics. For this part of the assignment you will do something similar. Your task is to keep track of the topics identified by users with the hashtag symbol #. You will also need to count the frequency of the hashtags you found and provide a ranking of hashtags based on their frequency. The output of your script should be one file, named top_hashtags.txt, with the N most popular hashtags, where N is a parameter to your function. For example, assume this is the content of your twitter_data.txt file: #lebron best athlete of our generation ML 5 Demos! Lots of great stuff to come! Yes, I'm excited. :) http://htmlfive.appspot.com #io2009 #googleio At GWT fireside chat #googleio @khalid0456 No, Lebron is the best #lebron If N is set to 2, then your script should generate a file top_hashtags.txt with the following content (note that in case of ties the order doesnt matter): #googleio 2 #lebron 2
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
