Question: Improving Emergency Failovers Case study adapted from: [T. A. Limoncelli, C. J. Hogan, S. R. Chalup, Practice of System and Network Administration, The: Volume 1:

Improving Emergency Failovers Case study adapted from: [T. A. Limoncelli, C. J. Hogan, S. R. Chalup, Practice of System and Network Administration, The: Volume 1: DevOps and other Best Practices for Enterprise IT. 3rd ed., AddisonWesley, 2017.]

A Stack Overflows main web site infrastructure is in a datacentre in New York City. If the datacentre fails or needs to be taken down for maintenance, duplicate equipment and software are running in Colorado. The duplicate in Colorado is a running and functional copy, except that it is in stand-by mode waiting to be activated. Database updates in NYC are replicated to Colorado. A planned switch to Colorado will result in no lost data. In the event of an unplanned failoverfor example, as the result of a power outagethe system will lose an acceptably small quantity of updates.

The failover process is complex. Database masters need to be transitioned. Services need to be reconfigured. It takes a long time and requires skills from four different teams. Every time the process happens, it fails in new and exciting ways, requiring ad hoc solutions invented by whoever is doing the procedure. In other words, the failover process is risky. When Tom was hired at Stack Overflow, his first thought was, I hope Im not on call when we have that kind of emergency. Drunk driving is risky, so we avoid doing it. Failovers are risky, so we should avoid them, too. Right?

Wrong. There is a difference between behaviour and process. Risky behaviours are inherently risky; they cannot be made less risky. Drunk driving is a risky behaviour. It cannot be done safely, only avoided. A failover is a risky process. A risky process can be made less risky by doing it more often. The next time a failover was attempted at Stack Overflow, it took ten hours. The infrastructure in New York had diverged from Colorado significantly. Code that was supposed to seamlessly fail over had been tested only in isolation and failed when used in a real environment. Unexpected dependencies were discovered, in some cases creating Catch-22 situations that had to be resolved in the heat of the moment.

This ten-hour ordeal was the result of big batches. Because failovers happened rarely, there was an accumulation of infrastructure skew, dependencies, and stale code. There was also an accumulation of ignorance: New hires had never experienced the process; others had fallen out of practice. To fix this problem the team decided to do more failovers. The batch size was based on the number of accumulated changes and other things that led to problems during a failover. Rather than let the batch size grow and grow, the team decided to keep it small. Rather than waiting for the next real disaster to exercise the failover process, they would introduce simulated disasters. The concept of activating the failover procedure on a system that was working perfectly might seem odd, but it is better to discover bugs and other problems in a controlled situation rather than during an emergency. Discovering a bug during an emergency at 4 AM is troublesome because those who can fix it may be unavailableand if they are available, theyre certainly unhappy to be awakened. In other words, it is better to discover a problem on Saturday at 10 AM when everyone is awake, available, and presumably sober. If schoolchildren can do fire drills once a month, certainly system administrators can practice failovers a few times a year. The team began doing failover drills every two months until the process was perfected. Each drill surfaced problems with code, documentation, and procedures. Each issue was filed as a bug and was fixed before the next drill. The next failover took five hours, then two hours, then eventually the drills could be done in an hour with no user-visible downtime.

The drills found infrastructure changes that had not been replicated in Colorado and code that didnt fail over properly. They identified new services that hadnt been engineered for smooth failover. They discovered a process that could be done by one particular engineer. If he was on vacation or unavailable, the company would be in trouble. He was a single point of failure.

Answer the following questions:

1) Consider the above scenario and explain is it better to failover or take down a perfectly running system than to wait until it fails on its own? Justify your answer [6 marks]

2) In your opinion, what is the difference between behaviour and process? [6 marks]

3) Discuss why are big batches more risky than small batches? [7 marks]

4) Consider the above scenario, discuss how it is better to have a small system improvement now than a large improvement a year from now? [6 marks]

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!

Case study adapted from: [T. A. Limoncelli, C. J. Hogan, S. R. Chalup, Practice of System and Network Administration, The: Volume 1: DevOps and other Best Practices for Enterprise IT. 3rd ed.,...

Worst Onboarding Experience Ever Case study adapted from: [T. A. Limoncelli, C. J. Hogan, S. R. Chalup, Practice of System and Network Administration, The: Volume 1: DevOps and other Best Practices...

Healthcare Quarterly Healthcare Quarterly, 10(Sp) 2006: 10-19 Transforming Healthcare Organizations Brian Golden Abstract Imagine you are a member of a hospital's executive team, having just left a...

Project Title To determine the impacts of rail infrastructure improvement on Zambias economy- a case study of Tanzania-Zambia railway line DEDICATION This academic piece of work is devoted with...

Hospital emergency response checklist An all-hazards tool for hospital administrators and emergency managers Supported by The European Commission Health Programme 2008-2013 Together for Health...

In this case study page 1 untill page 7, pls justify the task no 2 which is " Summarize the main issues presented in the article." with use the guideline given. i want 2-3 pages WTSCE'IA the...

HAD509 Written Case Analysis Content and Format (Adapted from: Simendinger, E. (2003). In Search of a Course Design and Teaching Methods to Improve Critical Thinking. Journal of Health Administration...

CHAPTER 8 Hospitals and Health Systems Stephen J. Williams and Paul R. Torrens W I L S CHAPTER TOPICS O History of the Hospital N The Scope of the Industry , Structure of Hospitals and Health Systems...

Journal of Information Technology Education Volume 6, 2007 The Delphi Method for Graduate Research Gregory J. Skulmoski Zayed University, Dubai, United Arab Emirates Francis T. Hartman and Jennifer...

ACHE HEALTHCARE EXECUTIVE 2016 COMPETENCIES ASSESSMENT TOOL T he American College of Healthcare Executives Healthcare Executive Competencies Assessment Tool is offered as an instrument for healthcare...

Sperry Space and Aeronautics is a research and development firm that contracts with the government to design specialized equipment for use in the NASA space shuttle program. Its income statements and...

Mercedes Co. acquired all of the common stock of Tesla Co. on January 1, 2018. As of that date, Tesla had the following trial balance: Book Value Fair Value Current Assets $50,000 $50,000 Land...

Max Company is a mortgage company that deals with many different customers, and are trying to embrace more security to probect their customers. Their customers are tvpically located in a very large...

Problem 21.4A (Static) Determining the Most Profitable Product Given Scarce Resources (L021-1, L021- 2, L021-3. L021-4) Insiteful Instruments produces two models of binoculars. Information for each...