Question: Improving Emergency Failovers Case study adapted from: [T. A. Limoncelli, C. J. Hogan, S. R. Chalup, Practice of System and Network Administration, The: Volume 1:

Improving Emergency Failovers Case study adapted from: [T. A. Limoncelli, C. J. Hogan, S. R. Chalup, Practice of System and Network Administration, The: Volume 1: DevOps and other Best Practices for Enterprise IT. 3rd ed., AddisonWesley, 2017.]

A Stack Overflows main web site infrastructure is in a datacentre in New York City. If the datacentre fails or needs to be taken down for maintenance, duplicate equipment and software are running in Colorado. The duplicate in Colorado is a running and functional copy, except that it is in stand-by mode waiting to be activated. Database updates in NYC are replicated to Colorado. A planned switch to Colorado will result in no lost data. In the event of an unplanned failoverfor example, as the result of a power outagethe system will lose an acceptably small quantity of updates.

The failover process is complex. Database masters need to be transitioned. Services need to be reconfigured. It takes a long time and requires skills from four different teams. Every time the process happens, it fails in new and exciting ways, requiring ad hoc solutions invented by whoever is doing the procedure. In other words, the failover process is risky. When Tom was hired at Stack Overflow, his first thought was, I hope Im not on call when we have that kind of emergency. Drunk driving is risky, so we avoid doing it. Failovers are risky, so we should avoid them, too. Right?

Wrong. There is a difference between behaviour and process. Risky behaviours are inherently risky; they cannot be made less risky. Drunk driving is a risky behaviour. It cannot be done safely, only avoided. A failover is a risky process. A risky process can be made less risky by doing it more often. The next time a failover was attempted at Stack Overflow, it took ten hours. The infrastructure in New York had diverged from Colorado significantly. Code that was supposed to seamlessly fail over had been tested only in isolation and failed when used in a real environment. Unexpected dependencies were discovered, in some cases creating Catch-22 situations that had to be resolved in the heat of the moment.

This ten-hour ordeal was the result of big batches. Because failovers happened rarely, there was an accumulation of infrastructure skew, dependencies, and stale code. There was also an accumulation of ignorance: New hires had never experienced the process; others had fallen out of practice. To fix this problem the team decided to do more failovers. The batch size was based on the number of accumulated changes and other things that led to problems during a failover. Rather than let the batch size grow and grow, the team decided to keep it small. Rather than waiting for the next real disaster to exercise the failover process, they would introduce simulated disasters. The concept of activating the failover procedure on a system that was working perfectly might seem odd, but it is better to discover bugs and other problems in a controlled situation rather than during an emergency. Discovering a bug during an emergency at 4 AM is troublesome because those who can fix it may be unavailableand if they are available, theyre certainly unhappy to be awakened. In other words, it is better to discover a problem on Saturday at 10 AM when everyone is awake, available, and presumably sober. If schoolchildren can do fire drills once a month, certainly system administrators can practice failovers a few times a year. The team began doing failover drills every two months until the process was perfected. Each drill surfaced problems with code, documentation, and procedures. Each issue was filed as a bug and was fixed before the next drill. The next failover took five hours, then two hours, then eventually the drills could be done in an hour with no user-visible downtime.

The drills found infrastructure changes that had not been replicated in Colorado and code that didnt fail over properly. They identified new services that hadnt been engineered for smooth failover. They discovered a process that could be done by one particular engineer. If he was on vacation or unavailable, the company would be in trouble. He was a single point of failure.

Answer the following questions:

1) Consider the above scenario and explain is it better to failover or take down a perfectly running system than to wait until it fails on its own? Justify your answer [6 marks]

2) In your opinion, what is the difference between behaviour and process? [6 marks]

3) Discuss why are big batches more risky than small batches? [7 marks]

4) Consider the above scenario, discuss how it is better to have a small system improvement now than a large improvement a year from now? [6 marks]

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!