Amazon Web Services on Friday published an explanation for an hours-long outage earlier this week that disrupted its retail business and third-party online services. The company also said it plans to revamp its status page.
The problems in Amazon’s large US-East-1 region of data centers in Virginia began at 10:30 a.m. ET on Tuesday, the company said.
“An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network,” the company wrote in a post on its website. As a result, devices connecting an internal Amazon network and AWS’ network became overloaded.
Several AWS tools suffered, including the widely used EC2 service that provides virtual server capacity. AWS engineers worked to resolve the issues and bring back services over the next several hours. The EventBridge service, which can help software developers build applications that take action in response to certain activities, didn’t bounce back fully until 9:40 p.m. ET.
Downtime can hurt the perception that cloud infrastructure is reliable and ready to handle migrations of applications from physical data centers. It can also have major implications on businesses. AWS has millions of customers and is the leading provider in the market.
AWS apologized for the impact the outage had on its customers.
Popular websites and heavily used services were knocked offline, including Disney+, Netflix and Ticketmaster. Roomba vacuums, Amazon’s Ring security cameras and other internet-connected devices like smart cat litter boxes and app-connected ceiling fans were also taken down by the outage.
Amazon’s own retail operations were brought to a standstill in some pockets of the U.S. Internal apps used by Amazon’s warehouse and delivery workforce rely on AWS, so for most of Tuesday employees were unable to scan packages or access delivery routes. Third-party sellers also couldn’t access a site used to manage customer orders.
During the outage, AWS tried to keep customers aware of what was happening, but the cloud ran into trouble updating its status page, known as the Service Health Dashboard.
“As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue,” AWS said.
In addition, customers couldn’t create support cases for seven hours during the disruption.
AWS said it’s now taking action to address both of those issues.
“We expect to release a new version of our Service Health Dashboard early next year that will make it easier to understand service impact and a new support system architecture that actively runs across multiple AWS regions to ensure we do not have delays in communicating with customers,” AWS said.
It’s not the first time for AWS to change the way it reports issues.
In 2017, an outage that hit the popular AWS S3 storage service prevented engineers from showing the right color to indicate uptime on the Service Health Dashboard. Amazon posted banners and went to Twitter to release new information.
“We have changed the SHD administration console to run across multiple AWS regions,” Amazon said in a message about that episode.