Amazon’s AWS S3 service had a bad day on Feb. 28, caused by simple human error. The problems with the web-based storage service lead to tons of problems across the internet including partially or fully broken websites, apps and devices that use its services. Those websites and devices affected included Nest, Business Insider, Giphy, Quora, Trello, Medium and Slack’s file sharing, as well as Amazon’s own music app and dashboard. The outage appears to have begun at 12:45 p.m. EST.
Amazon’s service health dashboard originally noted that the S3 outage was due to “high error rates with S3 in US-EAST-1. AWS services and customer applications depending on S3 will continue to experience high error rates as we are actively working to remediate the errors.”
As of 3:44 p.m. EST Amazon announced it had identified the root of the problem. And as of 4:49 p.m. EST, Amazon announced that “We are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.” The next morning all was still well. The only area affected by the outage was Amazon’s northern Virginia site.
Amazon announced that the cause of the outage was some inaccurately entered code during a debugging, which cause a large number of servers to be removed. Also, those servers supported two other subsystems. One of those, the index subsystem, manages the metadata and location information of all S3 objects in that region. The subsystem is necessary to serve all GET, LIST, PUT and DELETE requests. The other subsystem, placement subsystem, requires the first to be in working order to PUT objects. To fix the problem, Amazon says those systems required a full restart. The restart took longer than expected because of the growth at that location over the last several years.
Amazon says it is “making several changes as a result of this operational event,” and it apologized to its customers for the impact of the outage. The company says it will be speeding up recovery time and creating new safeguards so that team members can’t inadvertently do this again. It will also make changes to its service health dashboard, which was part of the outage.
Historically Amazon outdoes its promised 99.999999 percent stability, and offers refunds when outages do happen. In 2015, the service experienced a similar problem for several hours. Amazon is 10 times larger than its competitors, including Microsoft, Google and IBM combined. It’s also worth noting that Microsoft Azure had an outage on Feb. 19 that lasted more than 5 hours.
For businesses looking to avoid the problems created by the Amazon outage is to 1) build a system that doesn’t solely rely on S3, or 2) store your data in multiple regions. There are no fool-proof approaches, however, when humans are involved.
Powered by Facebook Comments