Question: WSC programmers often use data replication to overcome failures in the software. Hadoop HDFS, for example, employs three-way replication (one local copy, one remote copy

WSC programmers often use data replication to overcome failures in the software. Hadoop HDFS, for example, employs three-way replication (one local copy, one remote copy in the rack, and one remote copy in a separate rack), but it’s worth examining when such replication is needed.

a. Let us assume that Hadoop clusters are relatively small, with 10 nodes or less, and with dataset sizes of 10 TB or less. Using the failure frequency data in Figure 6.1, what kind of availability does a 10-node Hadoop cluster have with one-, two-, and three-way replications?

b. Assuming the failure data in Figure 6.1 and a 1000-node Hadoop cluster, what kind of availability does it have with one-, two-, and three-way replications? What can you reason about the benefits of replication, at scale?

Figure 6.1

Approx. number events in 1st year 1 or 2 4 1000s 5000

Approx. number events in 1st year 1 or 2 4 1000s 5000 Cause Power utility failures Cluster upgrades Hard-drive failures Slow disks Bad memories Misconfigured machines Flaky machines Individual server crashes Consequence Lose power to whole WSC; doesn't bring down WSC if UPS and generators work (generators work about 99% of time). Planned outage to upgrade infrastructure, many times for evolving networking needs such as recabling, to switch firmware upgrades, and so on. There are about nine planned cluster outages for every unplanned outage. 2%-10% annual disk failure rate (Pinheiro et al., 2007) Still operate, but run 10x to 20 more slowly One uncorrectable DRAM error per year (Schroeder et al., 2009) Configuration led to ~30% of service disruptions (Barroso and Hlzle, 2009) 1% of servers reboot more than once a week (Barroso and Hlzle, 2009) Machine reboot; typically takes about 5 min (caused by problems in software or hardware).

Step by Step Solution

3.36 Rating (159 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

To answer this question we would calculate the availability of the Hadoop cluster by considering the impact of the failure events on the nodes and how replications can mitigate these events Availabili... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Computer Architecture Questions!