Question: You will extend the MapReduce WordCount example from the slides of Scala Software Development to find top 500 words in a news dataset. This is
You will extend the MapReduce WordCount example from the slides of Scala Software Development to find top 500 words in a news dataset. This is the similar idea to word cloud that we want to know what keywords mentioned most in a corpus. Of course, they are more important than others in general but not always. Why? You will know the answer after you complete this homework.
Input: news-2016-2017.txt
Output: top 500 words in news articles.
Ideas: You follow the WordCount example and put the code in the Spark Shell line by line and observe the result. Note that you dont have to create any class nor object in this homework. You may simply exercise the following code from the WordCount program.
// Read each line of my book into an RDD val input = sc.textFile("news-2016-2017.txt")
// Split into words separated by a space character; flatMap one to many transformation val words = input.flatMap(x => x.split(" "))
// Count up the occurrences of each word val wordCounts = words.countByValue()
// Print the results. wordCounts.foreach(println)
To find the top 500 words, you will have to sort the words by their occurrences. They will be organized in (key, value) pairs. So we need to sort by value. Here is the prototype of sortBy() function defined inhttps://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html
defsortBy[K](f: (T) K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[ K]): RDD[T]
Return this RDD sorted by the given key function.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
