Question: Data description News articles for each day ( Jan 1 , 2 0 1 2 Dec 3 1 , 2 0 1 2 ) total

Data description

News articles for each day

(

Jan

1, 2012

Dec

31, 2012)

total

366

files

Every file:

All the articles published in each day

Each article is in a new line

Each article is in JSON format

JSON Fields for each article:

'city', 'code', 'title', 'text', 'source',

date

'

Problem:

Complete the following using spark:

Read the data

(

all the files in the

data

directory

)

using the function textFile

Take only the

text

part of each article and count the frequency of all the words

(

convert the text into lowercase

) [2

point

]

Remove

(

Filter

)

any word whose frequency is less than

10 [2

points

]

Report the following:

The total size of the output data

(

after the filtering

) [2

points

]

The frequency of the following words

congress, london, washington, football

[2

points

]

The word with maximum frequency for each month

(

hint: to read only a month

s articles, you can use

* .

E

.

g

.,

for February

2012 - 02 *

represents all files starting with

2012 - 02,

i

.

e

.,

files belonging to Feb

) [4

points

]

The list of words that appeared on

2012 - 09 - 01

but not on

2012 - 08 - 01 [4

points

]

The frequency of the word

monsoon

for all months

[4

points

]

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Q:

Using Python: Since I can't attach the data set, I will explain what it is: News articles for each day (Jan 1, 2012 Dec 31, 2012) total 366 files Every file: All the articles published in each day...

Q:

Please advise if you can answer all the questions in attached questionnaire as per the attachedcase of IVEY PUBLISHING: 9B14N013. (HUDBAY MINERALS: ACQUISITION OF NORSEMONT MINING. If answering all...

Q:

I have had a tutor help me with this accounting cycle problem, but its done differently than the way we learned it. This is from chapters 1-5 in College Accounting 13th edition. If I can have someone...

Q:

I have had a tutor help me with this accounting cycle problem, but its done differently than the way we learned it. This is from chapters 1-5 in College Accounting 13th edition. If I can have someone...

Q:

Osorio Alcan Report (1) Protected View Seved to this PC Design Layout References Mailings Review View Help Tell me what you want to do -Files from the Internet can contain viruses. Unless you need to...

Q:

You are the assistant project coordinator for the project Fleming Creative Community. You have been assigned to add resources and assign them to the tasks and report on cost information Create a file...

Q:

I need help to complete the task. Attached in uncompelete report based on this organization. www.icmaz.org. The attached instruction will give you direction and the excelsheet protfolio will provide...

Q:

I need help to complete the task. Attached in uncompelete report based on this organization. www.icmaz.org. The attached instruction will give you direction and the excelsheet protfolio will provide...

Q:

the data can just be made up General Requirements: 1. 2. Your program will begin with the following command: a.out ki k2 k3 k4 ks In_file Out_file a.out is the executable file; In_file and out_file...

Q:

Please help me with the following questions. you are required to submit supporting calculations on all questions that require calculations. Some questions do not require calculations. If a question...

Q:

A low-mass star with 30% the mass of the Sun begins its life on the Main Sequence with 30 % of the hydrogen fuel of the Sun. However, since its luminosity is 1% of the Sun's luminosity, it will burn...

Q:

Consider the two mutually exclusive investment alternatives given in Table P7.41. TABLE P7.4I (a) Determine the IRR on the incremental investment in the amount of $5,000. (Assume that MARR = 10%.)...

Q:

The NPV and the IRR methods sometimes could result in conflicting ranks between projects. These conflicts will arise when evaluation is being made on: Both independent and mutually exclusive projects...

Q:

The production manager for the Classic Boat Corporation must determine how many units of the Classic 21 model to produce over the next four quarters. The company has a beginning inventory of 100 Class

Recommended Textbook

More Books

Computer Performance Evaluation Modelling Techniques And Tools Modelling Techniques And Tools 12th International Conference Tools 2002 London Uk

Authors: Tony Field ,Peter G. Harrison ,Jeremy Bradley ,Uli Harder

2002nd Edition

3540435395, 978-3540435396

Ask a Question and Get Instant Help!