Question: You will extend the MapReduce WordCount example from the slides of Scala Software Development to find top 500 words in a news dataset. This is

You will extend the MapReduce WordCount example from the slides of Scala Software Development to find top 500 words in a news dataset. This is the similar idea to word cloud that we want to know what keywords mentioned most in a corpus. Of course, they are more important than others in general but not always. Why? You will know the answer after you complete this homework.

Input: news-2016-2017.txt

Output: top 500 words in news articles.

Ideas: You follow the WordCount example and put the code in the Spark Shell line by line and observe the result. Note that you dont have to create any class nor object in this homework. You may simply exercise the following code from the WordCount program.

// Read each line of my book into an RDD val input = sc.textFile("news-2016-2017.txt")

// Split into words separated by a space character; flatMap one to many transformation val words = input.flatMap(x => x.split(" "))

// Count up the occurrences of each word val wordCounts = words.countByValue()

// Print the results. wordCounts.foreach(println)

To find the top 500 words, you will have to sort the words by their occurrences. They will be organized in (key, value) pairs. So we need to sort by value. Here is the prototype of sortBy() function defined inhttps://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html

defsortBy[K](f: (T) K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[ K]): RDD[T]

Return this RDD sorted by the given key function.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!