Question: Task Spark Streaming Task Spark Streaming Develop a spark streaming program with Scala to monitor a folder on HDFS in real - time such that
Task Spark Streaming
Task Spark Streaming
Develop a spark streaming program with Scala to monitor a folder on HDFS in realtime such that any new
file in the folder will be processed the batch interval is seconds
The following three tasks are implemented in the same Scala object:
A For each RDD of Dstream, count the word frequency and save the output on HDFS
Each line of text is multiple words separated by a single space. For each word,
it is retained if it consists of characters only, ie remove the word if it includes
numbers, punctuations, special characters, etc., and
filter out ie delete the short words ie characters
For example, I like pig latin I like hive too I dont like hive too." should
be parsed as "like pig latin like hive like hive
B For each RDD of Dstream, process each word in the same way, and then count the cooccurrence
frequency of words refer to week for the explanation of cooccurrence frequency The words are
considered cooccurred if they are in the same line. If a word appears in a line more than once, each is
simply treated as an independent word do not deduplicate Save the output on HDFS
For example, given the input "like pig like hive", the cooccurrence frequency is like pig,
like like, like hive, pig like, pig hive,hive like, hive pig,
C For the Dstream, process each word in the same way, and then count the cooccurrence frequency of
words the words are considered cooccurred if they are in the same line; save the output on HDFS Note
you are required to use updateStateByKey operation to continuously update the cooccurrence frequency
of words with new information.
Functional Requirements:
a For each task, the output on HDFS should be named with a unique sequence number as a suffix. For
example, taskA taskA taskB taskB taskC taskCdo not use other values
such as the system time when the task is running as the suffix of the output names
b For each task, if an RDD is empty, do not output.
c For each task, the batch interval is seconds.
d If a checkpoint directory is needed, you must set it to the current working directory.
e Paths of input and output should be passed as arguments of the sparksubmit command.
f You need to create a single Scala project including all three tasks so that they work on the same
stream data.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
