Question: For the following problem describe how you would solve it using MapReduce. The input is a list of documents (ID, text). The output should be
For the following problem describe how you would solve it using MapReduce. The input is a list of documents (ID, text). The output should be the count of each word over all documents. You are given only two machines. 1) We don't count stop words. List the words you want to count. What would be the final word frequency result? 2) You should explain how the input is mapped into (key, value) pairs by the map stage, i.e., specify what is the key and what is the associated value in each pair, and, if needed, how the key(s) and value(s) are computed. Then you should explain how the (key, value) pairs produced by the map stage are processed by the reduce stage to get the final answer(s). 3) At the beginning, if the first machine stores the first three documents and the second machine stores the last document, what should be stored in the two machines after the shuffling stage to make sure the computation time of the remaining process is minimal? 4) Now there are 5 documents: At the beginning, how would you distribute them into two machines to minimize the computational time of the whole MapReduce process? Explain in details
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
