The AWS Outage That Broke the Internet — And the Typo That Caused It — WebsiteDown Blog

On February 28, 2017, starting around 9:37 AM Pacific Time, the internet broke. Not all of it — but enough that millions of people noticed. Slack slowed to a crawl. Trello went down. GitHub had issues. Medium failed. Business websites stopped loading. The S3 status dashboard itself would not load.

The cause: a typo in a command line tool. Entered by one AWS engineer. During routine maintenance.

This is not ancient history. The same architecture is running today. And the lessons from it are more relevant now than ever.

What actually happened

AWS was debugging an issue with their billing system — specifically the S3 billing index subsystem. A senior engineer ran a command to remove a "small number" of servers from the billing index. The command had a parameter that was set too aggressively. Instead of removing a small number of servers, it removed a larger set than intended — including servers that were not supposed to be removed.

The removed servers included two critical subsystems: the index subsystem and the metadata subsystem. Without these, S3 could not work. New objects could not be written. Existing objects could not be listed or deleted. The entire US-EAST-1 region of S3 — one of the largest data storage systems in human history — was effectively read-only broken.

It took AWS four hours to fully restore service. The primary reason it took so long: the S3 subsystems had not been fully restarted in years. AWS had to wait for them to restart, which took longer than expected because the systems were checking data integrity on enormous amounts of data before coming back online.

Why one region going down caused global problems

Most people assume cloud outages in one region would only affect services hosted in that region. The 2017 outage proved this assumption wrong in a painful way.

US-EAST-1 is not just a region — it is where many of AWS's own internal control plane services run. The services that manage other regions, handle authentication, and serve status dashboards all ran in US-EAST-1. When it went down, AWS's own monitoring and management tools broke.

This is why the AWS service health dashboard itself went down. It was hosted on S3 in US-EAST-1. The system AWS used to tell the world it was having an outage was itself a victim of the outage.

Many companies also keep backups and logging infrastructure in US-EAST-1 even when their production is elsewhere. When S3 in US-EAST-1 failed, log ingestion failed, which triggered errors in applications that were otherwise healthy.

The cascading failure pattern

The 2017 S3 outage is a textbook example of cascading failure — where one component's failure causes others to fail in sequence, often in unexpected ways.

Application servers lose the ability to load configuration files (stored in S3). They fall back to startup defaults, which may be misconfigured for production. Some crash. Others serve stale data. CDN nodes lose the ability to fetch asset updates, so they continue serving cached versions — but those caches eventually expire and requests begin failing.

The failure then spreads laterally. Service A depends on Service B. Service B stores data in S3. Service B degrades. Service A's error rate rises. Service A triggers circuit breakers and starts returning errors to its clients. Service C, a client of Service A, sees elevated error rates and triggers its own circuit breakers. A failure in storage propagates into failures in compute, networking, and user-facing APIs — all in different parts of the internet.

What AWS changed afterward

AWS published an unusually detailed post-mortem (they almost always do — it is one of the best things about AWS). The key changes:

**Command validation:** The tool used by the engineer now requires explicit confirmation and validates the scope of changes before execution. "Are you sure you want to remove 1,432 servers? Type CONFIRM to proceed."

**Minimum capacity safeguards:** Systems now enforce minimum fleet sizes. A command that would reduce capacity below a safe threshold is rejected automatically.

**Faster restart capability:** AWS invested in making the S3 subsystems restartable faster. The four-hour recovery time was partly due to integrity checks on cold restart. Subsequent releases improved this significantly.

**Status page infrastructure:** The status dashboard was moved off S3 US-EAST-1. It now runs on a separate, deliberately diversified infrastructure that is not subject to the same failure modes as the service it monitors.

The lessons that still apply

The 2017 outage happened because of a combination of factors that still exist today: human operators running commands, cascading dependencies, and critical infrastructure that had not been tested under failure conditions in years.

No system is immune. Netflix, Google, Cloudflare, and AWS itself have all had major outages since 2017. The lesson is not "avoid AWS" — it is "design for failure."

For website owners: run across multiple regions. Store critical configuration locally, not just in S3. Test your disaster recovery plan before a disaster. Know your dependencies and their dependencies.

For users: when a bunch of unrelated services all go down at once, check AWS's status. It is usually the common thread. WebsiteDown tracks AWS along with the services that depend on it.