Wednesday, October 06, 2021

The Aging Protocol That Keeps Breaking the Internet

Monday's Facebook outage was just the latest in a series of major outages going back years. Although Facebook hasn't given out any technical details, other than saying that the outage was caused by an internal network configuration error, it's likely that it was caused by something called the Border Gateway Protocol (BGP). This is a networking protocol that tells routers where to route the data going into or coming out of a network. It's been around since 1989 and has been the source of many outages, including an infamous one that took YouTube off of the internet for a few hours.

This article goes into the history of BGP, which is sometimes called the "three napkin protocol" because of the way it was developed, and explains some of the reasons why it is so fragile. 

At its most basic level, BGP helps routers decide how to send giant flows of data across the vast mesh of connections that make up the internet. With infinite numbers of possible paths - some slow and meandering, others quick and direct - BGP gives routers the information they need to pick one, even though there is no overall map of the internet and no authority charged with directing its traffic.

The creation of BGP, which relies on individual networks continuously sharing information about available data links, helped the internet continue its growth into a worldwide network. But BGP also allows huge swaths of data to be "hijacked" by almost anyone with the necessary skills and access.

The main reason is that BGP, like many key systems on the internet, is built to automatically trust users - something that may work on smaller networks but leaves a global one ripe for attack.

I had to learn a little bit about BGP while I was working at the TSX, and I'm very glad that it's not something that I have to worry about in my home network.  

Update: Here's an article from Gizmodo that goes into a little more detail about what happened at Facebook and how a BGP misconfiguration caused the problem. 

No comments: