Question: You will extend the MapReduce WordCount example from the slides of Scala Software Development to find top 500 words in a news dataset. This is

You will extend the MapReduce WordCount example from the slides of Scala Software Development to find top 500 words in a news dataset. This is the similar idea to word cloud that we want to know what keywords mentioned most in a corpus. Of course, they are more important than others in general but not always. Why? You will know the answer after you complete this homework.

Input: news-2016-2017.txt

Output: top 500 words in news articles.

Ideas: You follow the WordCount example and put the code in the Spark Shell line by line and observe the result. Note that you dont have to create any class nor object in this homework. You may simply exercise the following code from the WordCount program.

// Read each line of my book into an RDD val input = sc.textFile("news-2016-2017.txt")

// Split into words separated by a space character; flatMap one to many transformation val words = input.flatMap(x => x.split(" "))

// Count up the occurrences of each word val wordCounts = words.countByValue()

// Print the results. wordCounts.foreach(println)

To find the top 500 words, you will have to sort the words by their occurrences. They will be organized in (key, value) pairs. So we need to sort by value. Here is the prototype of sortBy() function defined inhttps://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html

defsortBy[K](f: (T) K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[ K]): RDD[T]

Return this RDD sorted by the given key function.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

For the exclusive use of S. Setiawan, 2015. 9-910-036 REV: APRIL 11, 2011 BENJAMIN EDELMAN THOMAS R. EISENMANN Go oogle In nc. Go oogle's mission is to organize the world's inf n nformation and make...

For the exclusive use of F. Ortolano, 2015. 9-910-036 REV: APRIL 11, 2011 BENJAMIN EDELMAN THOMAS R. EISENMANN Go oogle In nc. Go oogle's mission is to organize the world's inf n nformation and make...

Assessment Brief - Assessment 3 - Using Map - Reduce for processing big data Unit Code / Description ICT 3 1 3 Big Data for Software Development Course / Subject Bachelor of Information Technology...

Please make a 1-2 pages analysis of the article below. The analysis should include your opinion also. Tryck p Esc fr att stnga helskrmen GLOBALIZATION Globalization in the Age of Trump by Pankaj...

Financial Statement analysis - Homework case number 4. Accounting for Software Revenue Recognition Is Computer Associates Being Too Aggressive? Annual Revenues at Computer Associates (CA) dropped...

Please make a 1-2 pages international marketing analysis of the article below. The analysis should include your opinion also. Tryck p Esc fr att stnga helskrmen GLOBALIZATION Globalization in the Age...

CHA P TER 9 Understanding Software: A Primer for Managers 1. INTRODUCTION L E A R N I N G O B J E C T I V E S 1. Recognize the importance of software and its implications for the rm and strategic...

Chapter 7 Revising and Presenting Your Writing I'm not a very good writer, but I'm an excellent rewriter. James A. Michener Half my life is an act of revision. John Irving Getting Started INT RODU CT...

could you please help me Accountung Theory and practice course assignment? i have to do next week monday. Please i need your help i have to choose 3 articles below and apply acoounting theory. I...

Hello, Sarah could you please help me Accountung Theory and practice course assignment? i have to do next week monday. Please i need your help i have to choose 3 articles below and apply acoounting...

A political pollster is conducting an analysis HS9 of sample results in order to make predictions on election night. Assuming a two-candidate election, if a specific candidate receives at least 55%...

For the titanium alloy, whose stress strain behavior may be observed in the "Tensile Tests" module of Virtual Materials Science and Engineering (VMSE), determine the following: (a) the approximate...

Which of the following statements about the decline stage of the industry life cycle is correct? the product may have reached obsolescence. the product may have reached obsolescence, and the industry...

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

Within the White-Collar Occupations, where do Females have the strongest and weakest representations in the Federal Service?

What are the Five Phases of SDLC? Explain each briefly.

How can Change Control Procedures manage Project Creep?