Question: Task Spark Streaming Task Spark Streaming Develop a spark streaming program with Scala to monitor a folder on HDFS in real - time such that

Task Spark Streaming
Task Spark Streaming
Develop a spark streaming program with Scala to monitor a folder on HDFS in real-time such that any new
file in the folder will be processed (the batch interval is 3 seconds).
The following three tasks are implemented in the same Scala object:
A. For each RDD of Dstream, count the word frequency and save the output on HDFS.
Each line of text is multiple words separated by a single space. For each word,
it is retained if it consists of characters only, i.e., remove the word if it includes
numbers, punctuations, special characters, etc., and
filter out (i.e., delete) the short words (i.e.,<3 characters).
For example, "I*** like pig latin I like hive too2 I dont like hive too." should
be parsed as "like pig latin like hive like hive
B. For each RDD of Dstream, process each word in the same way, and then count the co-occurrence
frequency of words (refer to week 4 for the explanation of co-occurrence frequency). The words are
considered co-occurred if they are in the same line. If a word appears in a line more than once, each is
simply treated as an independent word (do not deduplicate). Save the output on HDFS.
For example, given the input "like pig like hive", the co-occurrence frequency is (like pig,
2)(like like, 2)(like hive, 2)(pig like, 2)(pig hive,1)(hive like, 2)(hive pig,
1).
C. For the Dstream, process each word in the same way, and then count the co-occurrence frequency of
words (the words are considered co-occurred if they are in the same line); save the output on HDFS. Note
you are required to use updateStateByKey operation to continuously update the co-occurrence frequency
of words with new information.
Functional Requirements:
(a) For each task, the output on HDFS should be named with a unique sequence number as a suffix. For
example, taskA-001, taskA-002, taskB-001, taskB-002, taskC-001, taskC-002(do not use other values
such as the system time when the task is running as the suffix of the output names).
(b) For each task, if an RDD is empty, do not output.
(c) For each task, the batch interval is 3 seconds.
(d) If a checkpoint directory is needed, you must set it to the current working directory.
(e) Paths of input and output should be passed as arguments of the spark-submit command.
(f) You need to create a single Scala project including all three tasks so that they work on the same
stream data.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!