Question: Task Spark Streaming Task Spark Streaming Develop a spark streaming program with Scala to monitor a folder on HDFS in real - time such that

Task

Spark Streaming

Task

Spark Streaming

Develop a spark streaming program with Scala to monitor a folder on HDFS in real

-

time such that any new

file in the folder will be processed

(

the batch interval is

3

seconds

) .

The following three tasks are implemented in the same Scala object:

.

For each RDD of Dstream, count the word frequency and save the output on HDFS

.

Each line of text is multiple words separated by a single space. For each word,

it is retained if it consists of characters only, i

.

.,

remove the word if it includes

numbers, punctuations, special characters, etc., and

filter out

(

.

.,

delete

)

the short words

(

.

., < 3

characters

) .

For example,

"

* * *

like pig latin I like hive too

2

I don

t like hive too." should

be parsed as "like pig latin like hive like hive

.

For each RDD of Dstream, process each word in the same way, and then count the co

-

occurrence

frequency of words

(

refer to week

4

for the explanation of co

-

occurrence frequency

) .

The words are

considered co

-

occurred if they are in the same line. If a word appears in a line more than once, each is

simply treated as an independent word

(

do not deduplicate

) .

Save the output on HDFS

.

For example, given the input "like pig like hive", the co

-

occurrence frequency is

(

like pig,

2) (

like like,

2) (

like hive,

2) (

pig like,

2) (

pig hive,

1) (

hive like,

2) (

hive pig,

1) .

.

For the Dstream, process each word in the same way, and then count the co

-

occurrence frequency of

words

(

the words are considered co

-

occurred if they are in the same line

)

; save the output on HDFS

.

Note

you are required to use updateStateByKey operation to continuously update the co

-

occurrence frequency

of words with new information.

Functional Requirements:

(

)

For each task, the output on HDFS should be named with a unique sequence number as a suffix. For

example, taskA

- 001,

taskA

- 002,

taskB

- 001,

taskB

- 002,

taskC

- 001,

taskC

- 002 (

do not use other values

such as the system time when the task is running as the suffix of the output names

) .

(

)

For each task, if an RDD is empty, do not output.

(

)

For each task, the batch interval is

3

seconds.

(

)

If a checkpoint directory is needed, you must set it to the current working directory.

(

)

Paths of input and output should be passed as arguments of the spark

-

submit command.

(

)

You need to create a single Scala project including all three tasks so that they work on the same

stream data.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

I have to create a program in C and I can't figure it out. The program has to read a source file. Please help. /******************************************************************** PROJECT: Glossary...

PROJECT SCOPE [Instructions for what to include in this section: Define the scope of work that will be undertaken to provide the deliverable(s) mentioned in the Project Charter (PC). Craft this...

E-Commerce Case Study Analysis 1 Akamai Technologies: Attempting to Keep Supply Ahead of Demand In 2015, the amount of Internet traffic generated by YouTube alone is greater than the amount of...

Project/Lab (3) By the end of this project, the students will: Learn how to use Mininet. Become familiar with HEVC video transmission over mininet. Be able to identify bit streaming. Video streaming...

Problem 6. Basics for Spark Streaming and GraphX (Spark Streaming) Suppose we create a Spark Streaming Context "streamingContext" in Python Spark. streamingContext - StreamingContext(sc, 3) where...

The final project is an opportunity to broaden or deepen your knowledge about big data which have not been covered in class. It could be one of the following options: 1. learn new functionalities in...

Question 3 5 1 pts Select the correct statements. Spark Streaming discretizes the streaming data into micro - batches. Spark Streaming receives only single stream of data at a time. In Spark...

Problem 6 . Basics for Spark Streaming and GraphX (Spark Streaming) Suppose we create a Spark Streaming Context streamingContext in Python Spark. streamingContext = StreamingContext(sc, 3) where sc...

Each month, Joes Auto Parts uses exponential smoothing (with = 0.25) to predict the number of cans of brake fluid that will be sold during the next month. In June, Joe forecast that he would sell 37...

What mistakes did Harvey make during the succession process? As a result of having no written succession plan, what happened?

A credit card transaction fee is: Option A a fee that consumers must pay every time they utilize their credit card. Option B a fee that credit card companies charge to merchants for processing....

in addition to ad hoc de facto and concensus what is another method used to establish healthcare IT standards