Amazon reveals the cause of massive AWS outage, and it’s really embarrassing

Amazon has finally revealed the cause of the lengthy outage that disrupted service to dozens of internet services for hours — and it’s pretty embarrassing.

The cause, according to the company, who posted a very wordy explanation on its website Thursday, was “human error.” Which sounds bad enough until you find out exactly what the “human error” was: a typo.

 

According to Amazon, the trouble began Tuesday morning when an employee, who was conducting routine maintenance, mistakenly entered the wrong command while trying to take “a small number of servers” offline. Instead, that command took down a “a larger set of servers,” including those that support two S3 subsystems. (S3 is a data storage service used by a number of web-based services.)

The removal of these two, much larger systems, is what knocked so many services offline, including the ones necessary for Amazon to update its own status page. Making matters worse, the servers have not been completely restarted “for many years,” according to Amazon, so the restart process “took longer than expected.”

To recap: an employee’s typo caused a devastating chain of events that knocked a number of critical systems offline, and Amazon wasn’t prepared for how long it would take to bring them back online.

That certainly looks bad and Amazon seems to acknowledge this, pledging that they have taken new steps to ensure such an event can’t happen in the future. They say they’ve added “safeguards” to prevent their systems from being taken completely offline and are “reprioritizing” work to improve the recovery time of offline systems.

“We want to apologize for the impact this event caused for our customers,” the company wrote. “We will do everything we can to learn from this event and use it to improve our availability even further.”

Source:mashable.com

My name is Shashank Shekhar. I am a Software Engineer, currently working in one of the leading web-hosting companies in India. I am having 2 years of experience in Linux Server Administration.

I love to work in Linux environment & love learning new things.

Powered by Facebook Comments

Be the first to comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.