Amazon has finally revealed the cause of the lengthy outage that disrupted service to dozens of internet services for hours — and it’s pretty embarrassing.
The cause, according to the company, who posted a very wordy explanation on its website Thursday, was “human error.” Which sounds bad enough until you find out exactly what the “human error” was: a typo.
According to Amazon, the trouble began Tuesday morning when an employee, who was conducting routine maintenance, mistakenly entered the wrong command while trying to take “a small number of servers” offline. Instead, that command took down a “a larger set of servers,” including those that support two S3 subsystems. (S3 is a data storage service used by a number of web-based services.)
The removal of these two, much larger systems, is what knocked so many services offline, including the ones necessary for Amazon to update its own status page. Making matters worse, the servers have not been completely restarted “for many years,” according to Amazon, so the restart process “took longer than expected.”
To recap: an employee’s typo caused a devastating chain of events that knocked a number of critical systems offline, and Amazon wasn’t prepared for how long it would take to bring them back online.
That certainly looks bad and Amazon seems to acknowledge this, pledging that they have taken new steps to ensure such an event can’t happen in the future. They say they’ve added “safeguards” to prevent their systems from being taken completely offline and are “reprioritizing” work to improve the recovery time of offline systems.
“We want to apologize for the impact this event caused for our customers,” the company wrote. “We will do everything we can to learn from this event and use it to improve our availability even further.”
Powered by Facebook Comments