The Big Idea: What Took Down Amazon Web Services Was No Leprechaun, but Human Error.



Accuracy matters. It really does.

Close up shot red dart arrow on center of dartboard

In blogging, and sometimes in the analog world, mistakes are given a pass. Maybe it comes as a relief after years of being in school and losing point for the odd grammatical hiccup. Even professional writers fall back on a “Mistakes happen” excuse, but I return to those two all-important words that kicked off today’s Big Idea.

But how serious can you take a single mistake? You could always Amazon Web Services who—on account of a typo—suffered a major outage across their system, all over one mistyped command that caused the chaos.

First, you should know a bit about Amazon Web Services and what they do, if you don’t already. Amazon Web Services (or AWS, for short) is a secure cloud services platform, offering computing power, database storage, content delivery and other services that aid small businesses in their everyday operations. Companies around the world rely on Amazon Web Services to form the backbone of their presence on the Internet, leveraging AWS solutions to build sophisticated applications of flexibility, scalability and reliability.

And at the end of February, the Amazon Simple Storage Service (S3) which is the cloud storage part of AWS, was disrupted.

So when AWS suffers a problem, everybody notices.

Emergency light power on in Server roomWebsites and online services reliant on AWS started disappearing, and then started spewing out error codes. It took Amazon several hours to get a handle on the problem, but we now know what happened. An S3 engineer was looking into an issue causing the S3 billing system to function slowly, hoping to improve delivery time to AWS customers. In order to fix the problem the S3 team decided to take a small number of servers offline for testing. Just a few servers offline, for testing purposes, so that Amazon can find the problem, come up with a solution, and then replicate the solution on all the other servers. Sounds like a plan, right?

So they introduced code that was supposed to take a few servers offline. The problem is the head S3 engineer mistyped the command! His typo did not take out a few servers. It was more like a boatload of servers.

Where the confusion grows over this is how this single typo shouldn’t have caused a major outage, but the typo revealed that some of the AWS servers were key to a couple of S3 subsystems. One of those subsystems was the indexing subsystem handling metadata and location information for all S3 objects in the East Coast. Lose this indexing subsystem, and websites can’t find where the images are located and they can’t load them.

But that wasn’t the only subsystem affected. The other was the placement subsystem which handles storage allocation for new S3 objects which also shut down.

With these systems gone things were in bad shape.

Both of those subsystems had to be rebooted, but the system was now falling like an elaborate domino display. Other parts of AWS started to fail, including the Amazon S3 console, Elastic Compute Cloud (EC2), Elastic Block Stores (EBS), AWS Lambda, and the S3 APIs couldn’t be accessed. All this culminated into a complete meltdown of the system, taking several hours to fix as Amazon had never rebooted these two particular systems before.

In the wake of this, Amazon has gone on record to say it is going to make changes to procedures. I can see why they would do that.

So the takeaway from this tragic tale worth of an Irish ballad is that regardless of how big and robust a service becomes, it only takes one human with making a mistake to bring the whole system down. Leprechauns need not apply.





shurtz.jpgA research physicist who has become an entrepreneur and educational leader, and an expert on competency-based education, critical thinking in the classroom, curriculum development, and education management, Dr. Richard Shurtz is the president and chief executive officer of Stratford University. He has published over 30 technical publications, holds 15 patents, and is host of the weekly radio show, Tech Talk. A noted expert on competency-based education, Dr. Shurtz has conducted numerous workshops and seminars for educators in Jamaica, Egypt, India, and China, and has established academic partnerships in China, India, Sri Lanka, Kurdistan, Malaysia, and Canada.