Question: I need help on this Scala / Spark homework: **Please build an RDD using sc.textFile() for reading the words in. Given a text file basketball_words_only.txt,
I need help on this Scala / Spark homework:
**Please build an RDD using sc.textFile() for reading the words in. Given a text file basketball_words_only.txt, complete the following tasks:
1. Write a MapReduce program in Scala to find 1) the words that account for at least 3% of the document basketball_words_only.txt, 2) the 4 most frequent words in the document.
Remember Apache Spark uses lazy computation on RDDs. While many advantages exist, a disadvantage is that a same RDD may be recomputed. Please avoid this kind of recomputing in your program.
Below shows the correct output:
Words that account for at least 3% are "the","is","basketball","and",
the appears 10 times
basketball appears 8 times
is appears 6 times
and appears 6 times
2. Still using basketball_words_only.txt as input, write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
Example, in basketball_words_only.txt, the word basketball is followed by
? is five times
? has two times
? court once
Then is is the word that follows basketball the most
Another example, the word the is followed by
? ball three times
? court twice
? most once
? basket once
? end once
? game once
? team once
If there are multiple such followers that appear the most, pick any one arbitrarily.
At last, display the most frequent follower for basketball, the, and competitive as follows:
"basketball" is followed by "is" 5 times.
"the" is followed by "ball" 3 times.
"competitive" is followed by "basketball" 2 times.
The text file (Basketball_words_only.txt):
basketball is a team competitive sport in which two teams of five active players each try to score points against one another by throwing a ball through a 10 foot high hoop under organized rules basketball is one of the most popular and widely viewed sports in the court points are scored by passing the ball through the basket from above the team with more points at the end of the game wins the ball can be advanced on the court by bouncing it dribbling or passing it between teammates disruptive physical contact fouls is not permitted and there are restrictions on how the ball result be handled violations through time basketball has developed to involve common techniques of shooting passing and dribbling as well as players positions and offensive and defensive structures while competitive basketball is carefully regulated numerous variations of basketball has developed for casual play in some countries basketball is also a popular spectator sport while competitive basketball is primarily an indoor sport played on a basketball court less regulated variations have become exceedingly popular as an outdoor sport among both inner city and rural groups
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
