Roughly 48 hours after its major service outage, Amazon is admitting what caused the problem. Apparently, some poor engineer at Amazon Web Services (AWS) did an oopsie and brought the internet to its knees. Oopsies are the worst!
In all seriousness, it’s a sobering story. Here’s how Amazon described it in a recent blog post:
At 9:37AM PST, an authorised S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
We’ve all been there. You push the wrong button and end up getting Sprite instead of Coke. But this poor guy probably made an errant keystroke that crippled AWS for at least four hours. Since about a third of all internet traffic reportedly flows through AWS servers, deleting a whole bunch of those servers screwed up a few people’s days.
In theory, a series of failsafes should keep the fallout from such errors localised, but Amazon says that some of the key systems involved hadn’t been fully restarted in many years and “took longer than expected” to come back online.
The company now claims it’s “making several changes as a result of this operational event.” One of these changes will involve modifying a tool so that a large number of servers can’t be deleted at once. Which makes total sense, but still doesn’t solve the problem of unknown unknowns (like, say, a slower than expected restart) on an internet that relies so heavily on a single service.
In the meantime, let this serve as a shoutout to that poor AWS engineer who made a tiny mistake that led to major consequences. We’re having a rough year, too.
We’ve reached out to Amazon to find out more details about the incident, specifically the fate of the poor engineer who caused the problem. We’ll update this post when we hear back. [Amazon]