Question: Using Python: Since I can't attach the data set, I will explain what it is: News articles for each day (Jan 1, 2012 Dec 31,
Using Python:
Since I can't attach the data set, I will explain what it is:
- News articles for each day (Jan 1, 2012 Dec 31, 2012) total 366 files
- Every file:
- All the articles published in each day
- Each article is in a new line
- Each article is in json format
- Json Fields for each article:
- 'city', 'code', 'title', 'text', 'source', date'
Instructions using Python:
- Read the data (all the files in the data directory) using the function textFile
- Take only the text part of each article and count the frequency of all the words (convert the text into lowercase)
- Remove (Filter) any word whose frequency is less than 10
- Report the following:
- Total size of the output data (after the filtering)
- Frequency of the following words congress, london, washington, football
- The word with maximum frequency for each month (hint: to read only a months articles you can use *. E.g. for February 2012-02* represents all files starting with 2012-02,i.e. files belonging to Feb)
- List of words appeared on 2012-09-01 but not on 2012-08-01
- Frequency of the word monsoon for all months
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
