The internet feels permanent and robust, but it is built on layers of infrastructure that are each capable of failing catastrophically. These historical outages show just how fragile digital services can be — and how deeply we depend on them.
Meta, October 2021 — 7 hours
On October 4, 2021, Facebook, Instagram, and WhatsApp went offline simultaneously for approximately six to seven hours. With three billion combined users, this was the largest social media outage in history by affected user count.
The cause was a BGP (Border Gateway Protocol) configuration error during routine maintenance. BGP is the routing protocol that tells the internet how to reach Meta's servers. When the misconfiguration withdrew Meta's BGP routes, their domain names stopped resolving and their servers became unreachable — even for Meta's own engineers trying to fix the problem remotely. Teams had to physically access data centres to restore the configuration.
AWS us-east-1, December 2021 — 8 hours
Amazon Web Services experienced a major outage in its us-east-1 region (Northern Virginia) that lasted approximately eight hours. The incident degraded or broke hundreds of services including Ring (Amazon's home security system), Alexa, Roomba, Disney+, and Netflix.
The root cause was an automated scaling activity in AWS's networking that triggered unexpected behaviour in their internal traffic management systems, causing cascading failures across multiple services.
Fastly, June 2021 — 1 hour
In June 2021, the CDN provider Fastly experienced a global outage lasting about one hour. The affected sites included Reddit, Twitch, the UK government website (gov.uk), The Guardian, The New York Times, and Spotify.
The cause was a single customer account triggering a bug in Fastly's software that had been deployed two months earlier. One configuration change by one user caused a global CDN failure affecting major sites on six continents. Fastly's post-incident report is considered a model of transparency in the industry.
AWS S3, February 2017 — 5 hours
A 2017 AWS S3 outage in us-east-1 lasted approximately five hours and affected an enormous number of sites and services. S3 (Simple Storage Service) is used by a huge proportion of internet services to store static assets, backups, and media.
The cause was a typo in a command run during routine maintenance. An engineer meant to remove a small number of servers from the billing subsystem but accidentally included a much larger set, causing a cascade of failures. The incident prompted AWS to implement additional safeguards on their tooling.
What these outages have in common
Looking across major outages, a clear pattern emerges. Most are caused by human error during configuration changes — not hardware failure. The systems themselves are typically resilient, but human mistakes in deployment, routing, or scaling operations bypass that resilience.
The other pattern: concentration of infrastructure. When a single provider (AWS, Meta, Cloudflare, Fastly) serves a huge proportion of the internet, their failures have outsized impact. This is the fundamental trade-off of the cloud era: convenience and cost efficiency in exchange for systemic risk.