A massive internet outage that took down thousands of popular websites was thanks to a technical glitch on Amazon’s servers.
But it wasn’t the company’s retail front that crippled the web for so many – it was Amazon Web Services (AWS), which provides a crucial back-end for thousands of other websites. In this case, an AWS bug glitched for almost 4 hours, during which an estimated 100,000 sites were impacted or offline.
Sites known to be affected on Tuesday included Slack, Quora, Imgur, Apple, Yahoo, and Medium, before AWS was able to restore services at about 5pm ET.
Even isitdownrightnow.com – one of the sites you can check to see if other websites are up and running – was inaccessible thanks to the hiccup, which is pretty meta if you think about it.
So what happened? According to Amazon, the glitch was due to a problem in one of the company’s S3 datacentres in northern Virginia.
This facility, part of a network of operations called US-EAST–1, experienced high error rates sending and receiving clients’ hosted data, which meant thousands of sites and services couldn’t serve their pages properly, or load essential features.
The nature of the bug meant that some cloud services like Slack couldn’t run at all in some regions of the US. For news sites that rely on AWS – including The Huffington Post, The Verge, and Business Insider – it meant the website might go down, or that stories might display, but without pictures.
“Imagine your business not being able to run for a day,” head of research for Loup Ventures, Gene Munster, told Reuters. “That’s a big problem.”
While the experience of seeing so many sites impacted or offline is similar to last October’s massive internet outage that took down Twitter, Reddit, and Amazon itself – the causes are quite different.
In that case, the problems stemmed from a huge cyber attack – technically a Distributed Denial Of Service (DDOS) attack – where hackers directed millions of compromised devices to swarm on the technological infrastructure of a major web company called Dyn.
The effect of that attack meant that hundreds of popular websites couldn’t be accessed.
But with this new S3 error, while the end result might look the same from the outside, the bug doesn’t look to be the result of a hack; it’s just the consequence of a suspected software malfunction in Amazon’s datacentre technology.
These kinds of errors in servers and datacentres happen all the time, but the real problem here is just how big AWS has gotten – and how concentrated the internet’s reliance on it has become.
A Gartner study from last year found that AWS controls 31 percent of the market in global cloud infrastructure, which means that when major outages like this happen – like in 2011 and 2015– whole chunks of the internet can go down.
But even though these crippling outages can basically break parts of the internet for several hours at a time, once services get restored, people reconnect to their sites and apps and… tend to forget about it – until the next big crash comes along.
“Every time there’s a major cloud outage, you occasionally get customers who thought that everything would be magical and forever working,” Gartner analyst Lydia Leong told Ángel González at The Seattle Times.
“And then they’re disabused of that notion and everybody gets on with their lives.”
But even though it’s easy to forget the inconvenience of being cut off for an afternoon or evening, it all just goes to show one thing: between the ongoing threat of large-scale cyber attacks and the our dependence on tech giants who control the web’s infrastructure, the internet is actually pretty precarious.