A simple technical error caused a global blackout on Monday that prevented more than 2.9 billion Internet users from accessing Facebook, Instagram, WhatsApp and other tools.
The approximately six-hour disruption, the largest in company history, based on the number of users affected, occurred when Facebook Inc.
was trying to perform routine maintenance related to how Internet data travels through its network systems, according to a company blog post on Tuesday.
Seeking to get a feel for Facebook’s networking capability, engineers issued a networking command that inadvertently removed all of Facebook’s data centers from the corporate network. This led to a cascade of failures that took all of Facebook’s properties off the internet.
In the end, Facebook engineers, a team of people who built one of the most sophisticated networks in the world, had to use pre-internet technology to fix the problem. They had to go to data centers and restart systems there, the company said.
The outage was “not caused by malicious activity, but by error on our part,” Santosh Janardhan, Facebook vice president of infrastructure, wrote in the blog post.
The blackout had widespread and global ripple effects. It has cut essential communications in some parts of the world, disrupted e-commerce in some countries, hampered some small businesses, and led others to see a marketing opportunity. In some quarters, this has sparked reflection on the extent to which Facebook and its platforms are integrated with global connectivity.
Internet giants like Facebook have invested billions of dollars in their sprawling global data centers over the past few decades, designing their own networking equipment and the software that powers them.
This has allowed these businesses to operate with unmatched speed and efficiency, but it also creates vulnerability. The scale and complexity required to operate and maintain such a network, and the extent to which its infrastructure is managed and controlled by a single company, can lead to circumstances in which small errors can have a disproportionate impact, experts say. networked.
“It’s a team with endless resources and some of the most talented people,” said Doug Madory, director of internet analytics at network monitoring company Kentik. He said Facebook may not have applied enough scrutiny to its own backup solutions and processes.
A key question that Facebook has yet to answer is why the company’s backup network, known as the out-of-band network, did not work on Monday. This network is designed to be separate from the rest of Facebook and was supposed to provide engineers with a way to remotely fix systems in minutes when they go down.
In his blog post, Facebook’s Janardhan said the out-of-band network was not working yesterday, but did not explain why.
Instead, with engineers unable to reset their misconfigured equipment, a cascading set of failures ensued.
With the data centers going offline, the servers that used the Domain Name System, or DNS, to direct Internet traffic withdrew from the Internet. DNS is what browsers and mobile phones use to find Facebook services on the Internet, and without this connection it was “impossible for the rest of the Internet to find our servers,” Janardhan said.
The DNS changes also disabled internal tools that would have allowed Facebook engineers to restore service remotely, forcing engineering staff to go to data centers and restart systems there.
It took longer. “They are difficult to access, and once inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them,” Mr. Janardhan said. “So it took longer to enable the secure access protocols needed for people to be on site and able to work on the servers. “
Write to Robert McMillan at [email protected]
Copyright © 2021 Dow Jones & Company, Inc. All rights reserved. 87990cbe856818d5eddac44c7b1cdeb8