Question: MapReduce enables large amounts of parallelism by having data-independent tasks run on multiple nodes, often using commodity hardware; however, there are limits to the level

MapReduce enables large amounts of parallelism by having data-independent tasks run on multiple nodes, often using commodity hardware; however, there are limits to the level of parallelism. For example, for redundancy MapReduce will write data blocks to multiple nodes, consuming disk and, potentially, network bandwidth. Assume a total dataset size of 300 GB, a network bandwidth of 1 Gb/s, a 10 s/GB map rate, and a 20 s/GB reduce rate. Also assume that 30% of the data must be read from remote nodes, and each output file is written to two other nodes for redundancy. Use Figure 6.6 for all other parameters.

a. Assume that all nodes are in the same rack. What is the expected runtime with 5 nodes? 10 nodes? 100 nodes? 1000 nodes? Discuss the bottlenecks at each node size.

b. Assume that there are 40 nodes per rack and that any remote read/write has an equal chance of going to any node. What is the expected runtime at 100 nodes? 1000 nodes?

c. An important consideration is minimizing data movement as much as possible. Given the significant slowdown of going from local to rack to array accesses, software must be strongly optimized to maximize locality.
Assume that there are 40 nodes per rack, and 1000 nodes are used in the MapReduce job. What is the runtime if remote accesses are within the same rack 20% of the time? 50% of the time? 80% of the time?

d. Given the simple MapReduce program in Section 6.2, discuss some possible optimizations to maximize the locality of the workload.

Section 6.2,

Number of Average MapReduce Month Jobs Sep-16 95,775,891 Sep-15 115,375,750 Sep-14 55,913,646 Sep-13 28,328,775 Sep-12 15,662,118. Sep-11 7,961,481 Sep-10 5,207,069 Sep-09 4,114,919 Sep-07 2,217,000 Mar-06 171,000 Aug-04 29,000 completion time (s) 331 231 412 469 480 499 714 515 395 874 634 Average no. servers per job 130 120 142 137 142 147 164 156 394 268 157 Avg. no. cores per server 2.4 2.7 1.9 1.4 1.8 2.2 1.6 3.2 1.0 1.6 1.9 CPU Input data core years (PB) 311,691 11,553 272,322 8307 200,778 5989 81,992 2579 60,987 2171 40,993 1162 30,262 573 33,582 548 11,081 394 2002 51 217 3.2 Intermediate data (PB) 4095 3980 2530 1193 818 276 139 118 34 7 0.7 Output data (PB) 6982 5801 3951 1684 874 333 37 99 14 3 0.2 Figure 6.2 Monthly MapReduce usage at Google from 2004 to 2016. Over 12 years the number of MapReduce jobs increased by a factor of 3300. Figure 6.17 on page 461 estimates that running the September 2016 workload on Ama- zon's cloud computing service EC2 would cost $114 million. Updated from Dean, J., 2009. Designs, lessons and advice from building large distributed systems [keynote address]. In: Proceedings of 3rd ACM SIGOPS International Workshop on Large-Scale Distributed Systems and Middleware, Co-located with the 22nd ACM Symposium on Oper- ating Systems Principles, October 11-14, 2009, Big Sky, Mont.

Step by Step Solution

★★★★★

3.40 Rating (153 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock

Based on the question and the information provided we can attempt to calculate the expected runtime of a MapReduce job given different numbers of nodes and scenarios However please note that Figure 66 ... View full answer

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Computer Architecture Questions!

can someone solve this Modern workstations typically have memory systems that incorporate two or three levels of caching. Explain why they are designed like this. [4 marks] In order to investigate...

ttth Suppose that the sequence of bags {Bn | n N} is recursively enumerated by the computable function e(n, x) = fn(x), [7 marks] Hence prove that the set of all recursive bags cannot be recursively...

Design a Java class that represents a cache with a fixed size. It should support operations like add, retrieve, and remove, and it should evict the least recently used item when it reaches capacity.

6 . 1 [ 5 / 5 / 1 0 ] 6 . 2 , 6 . 4 > In this chapter, data - level parallelism has been discussed as a way for WSCstoachieve high performance on large problems. Conceivably, even greater performance...

I need a 10 page paper for my MIS class. Please do not copy and paste as my school is getting stricter on plagiarism. I have attached the assignment and the sample \fData Analytic Thinking 1 Data...

answer the following questions: What are your moral obligations as professionals to prevent such things from happening? What guidance do the professional codes provide (cite specific code sections in...

True-False Questions 1. Normalizing accounting data refers primarily to eliminating errors and outliers, thus creating ?normal data.? 2. An important reason for normalizing data is to eliminate data...

INTERNATIONAL REVIEW OF L AW C OMPUTERS & TECHNOLOGY , VOLUME 11, N UMBER 2, P AGES 251-261, 1997 The Data Mart: A New Approach to Data Warehousing PAM ELA PIPE Introduction Vendors have recently...

Do you expect robots to have a bigger impact inside or outside of factories in the next 15 years, and what implications does your answer have for the kinds of strategies that will gain competitive...

MAN 3025 CASE: USE FOR CLASS PURPOSES ONLY IBM: Reinvention of a Tech Titan IBM is currently struggling through a rough patch. Rapid changes in the information technology sector have wreaked havoc on...

Winterfell Products manufactures electrical switches for the aerospace industry. For the year ending 2019, they reported these revenues and expenses. Use this information to construct an income...

Suppose the concentration of glucose inside a cell is 0.1 mm and the cell is suspended in a glucose solution of 0.01 mm. a. What would be the free energy change involved in transporting 10-o mole of...

20.15 An office has one photocopying machine. Clerks are expected to make photocopies of the documents they require on their own. Past records show that on an average 4 clerks arrive per hour at the...

Interview five students and identify three consumer problems they have recognized recently. For each problem, determine a. The relative importance of the problem. b. How the problem occurred. c. What...

Explain how the three primary conditions required for DMA described in Chapter 9 of the text are met by the I/O channel architecture.

Discuss the major differences, advantages, and disadvantages between bus I/O and channel I/O.

PCI-Express, SATA, USB, FireWire, and Serial Attached SCSI (SAS) are all serial buses used to connect external devices to a computer system. Locate the specifications or descriptions of each type....

Will someone please help me? I am having some trouble starting my block of code in Python. Do I start it with from math import*?? My assignment requires me to create a program that will accept a...

Martinez Companys relevant range of production is 7,500 units to 12,500 units. When it produces and sells 10,000 units, its average costs per unit are as follows: Average Cost per Unit Direct...

Horton Housewares' accounts receivable decreased during the year by $3.5 million. Required: What is the amount of cash Horton received from customers during the reporting period if its sales were $57...