In January 2020, a radical public-health intervention was introduced to the world: the lockdown. Before China wheeled it out, it had never been used at any great scale outside Hollywood. No one believed a real-life populace would accept being shut in their homes for months on end, even if their lives depended on it.
Against all expectations, lockdowns are controlling a global pandemic. They are painful, but have bought time to come up with treatments and vaccines. So, why didn’t work stop and everyone revolt? The internet. As John Graham-Cunning, chief technology officer of the content delivery network (CDN) firm Cloudflare observed, “It's hard to imagine another utility (say electricity, water or gas) coping with a sudden and continuous increase in demand of 50%.”
The internet has survived and so has society. Hurray! But what can we learn from that?
Scaling a network without scaling the network
How did the web handle the massive surge in demand as hundreds of millions of people worldwide switched to seemingly perpetual virtual conferencing? Not by adding extra network infrastructure. That would have been way too slow. We needed action in hours, not months.
In fact, lots of people did lots of things that combined to keep the internet, and thus the world, running. Including:
- Leveraging the cloud using clever architecture
- Graceful service downgrades
- Efficiency upgrades
Together, that turned out to be more effective than any of us could have imagined.
Leveraging the cloud
When there’s a lack of resources in a system, it’s good if extra capacity can be added easily. The worst thing to need in a hurry is a new cable under the Atlantic. The best thing is a commodity server.
Telcos have been working for years on how to scale up their networks using cloud services. The holy grail is to take load off the backbone by using processing and storage to ease or smooth demand on physical wires, fibres, and radio waves. A great deal of the growth of the internet in recent years has happened inside data centres.
Smart architecture, CDNs, and buffers
Smart software architecture has played a vital role in increasing internet scale. CDNs are a good example of that. For decades, the web has been speeded up by content delivery networks that are built on servers and disks outside the internet service provider networks.
The earliest CDNs were marketed as ways for your customers to see the images or videos on your website faster. CDNs store copies of those assets all over the world, so a copy is always close to a requesting user. It’s a magic trick. The asset seems to be transmitted end-to-end quickly, but really it’s already nearby.
CDNs exploit one of the most powerful concepts in everything from computers to global logistics: buffers. Buffers are places to temporarily store stuff while it’s being moved from one location to another. Humanity has used them since we gave up being hunter-gatherers—the main downside of a nomadic existence is not much cupboard space.
- Your fridge is a buffer. It means you don’t need to go to the store every day.
- The store is a buffer. It means you don’t have to drive miles to the nearest warehouse.
- The warehouse is a buffer; otherwise you’d be touring farms across the world to do your shopping.
When it’s data in the cupboard rather than tinned beans, things get very cool because you can then copy the contents almost for free. In that case, buffers aren’t just a store, they’re a new source.
Buffers decouple the supply of an asset from its consumption and that can make networks more efficient by smoothing out or reducing traffic. But, they’re only useful if you have some idea what users are going to want, when, and where. CDNs use “push” and “pull” models for that. In the push model, CDNs move assets in advance, based on predictions about where they’ll be needed. In pull, the CDN observes what folk are asking for and downloads the same asset for others on the generally good assumption that we all want the same things.
Increased access speed for images and videos is great for end-user experience, but when a pandemic hits it’s the traffic-lowering superpower of CDNs that we need. The whole idea of a CDN is that one instance of an asset moves the long distance from its supplier to the CDN, which holds it in a buffer. Copies are then sent the shorter distances to their final destinations. The result is the overall load on the network is reduced compared to if all of those assets had been sent end-to-end.
That’s not all. If the supplier pushes the asset in advance, it can be moved when the network is quiet. It might even get to the CDN by non-network means, e.g. on a hard disk in a van (aka the sneakernet). Another way that “data is the new oil” is that sometimes it’s delivered by truck.
Not just a cupboard
By decoupling suppliers from users, buffers are not only stores. They’re interfaces.
The same assets can be served in different, context-specific, ways to different users. For example:
- Payloads can be encrypted at a CDN on demand.
- The encoding of videos can be changed depending on the user’s connection.
- Images can be manipulated and optimised for the device they’ll be displayed on.
All this makes networks smarter.
According to Rob Bushell, senior product manager at the CDN company Fastly, “On any given day, internet transit providers throughout the world collectively experience anywhere from a few to hundreds of temporary, short-lived, connectivity or performance degradations, often referred to as ‘internet weather’.”
All in all, this gives CDNs a lot of levers to pull when it comes to improving and stabilising the internet. By reducing traffic and choosing when and how data flows, load on the network can be evened out or removed entirely, leading to higher utilisation, less stress, and lower likelihood of outages. CDNs can also maintain service to end users when parts of the internet are overloaded or failing. In summary, they add resilience to the web.
Buffers are a deceptively powerful addition to the internet and they’re something data centres are excellent at providing in an intelligent way.
OK, we’ve established CDNs are the heroes of this story.
Or are they? Hold your horses. The above techniques are used highly effectively by CDNs, but they’re not unique to them. Many cloud services do the same stuff, with an equally stabilising effect. CDNs are just one excellent example of using a server- and storage-based architecture to enhance the backbone.
Did anything else help us survive the Covid crisis?
Brownouts, graceful downgrades, and other tricks
One way to preserve a service is to plan how you’re not going to preserve that service.
Last March, streaming providers like Netflix temporarily switched their European video offerings from high-definition encodings to lower-quality ones. They significantly cut network traffic by downgrading their service in a way that was acceptable to their customers. (It was a pandemic, for goodness sake. HD was the least of their users’ worries.)
Throughout 2020, video conferencing companies like Zoom used QoS (Quality of Service)-based downgrades where they had to. Luckily, video conferencing (VC) is a textbook example of where you can do that. According to industry expert Chris Liljenstolpe, “Current mass-market VC systems are very bandwidth forgiving. That prevented the meltdown.”
Low-latency/high-quality audio is vital to a call, but relatively cheap in terms of network usage. Video is far more resource intensive, but its quality can be reduced a lot without killing the call. So, that’s what video conferencing companies did. As well as putting on weight during the pandemic, we all started looking a bit blockier. Hopefully you sounded fine.
Video conferencing also uses cunning wheezes like filters and backgrounds, which cut how much data has to be sent end-to-end in exchange for more CPU processing somewhere along the line—usually on those scalable cloud servers. The famous kitten lawyer was a great example of that. Look how much more animated his image seems to be than the humans, whilst requiring less bandwidth. (The feline attorney required only a few standard, probably cached, kitten images, then all that was needed was to send short instructions to move eyes, mouth, and head.)
And it’s not just video-conferencing services that handle bandwidth variability. All well-architected applications do—because even in the good times, the internet is flaky.
Optimising what you have
It’s almost a cliché that you should never prematurely optimise, but that approach usually means you have underutilised capacity in your systems.
To keep the show on the road during 2020, Microsoft Teams did a lot of the kind of stuff I described previously, plus they also took the opportunity to wring more performance out of their existing architecture. For example, they switched from text-based to binary-encoded data formats in their caches.
For Microsoft, the pandemic forced them to tap unused potential. They’re now more efficient, which will help them meet their green hosting targets.
What's the catch?
Unfortunately, nothing is free.
Efficiency and performance usually add complexity to your systems, particularly if you want to maintain consistency of service to all your users. Microsoft Teams addressed this by implementing more testing and monitoring, including chaos testing.
What have we learned?
In 2020, the internet handled an unprecedented increase in demand and did exactly what it was designed to do: it stayed up.
This was the result of many things:
- Quickly-available overspill capacity (servers in data centres— aka the cloud)
- Intelligent server- and storage-based architectures like CDNs, which were already bolted around the backbone network, that could use that additional capacity.
- Smart downgrade schemes to handle demand overloads without total loss of service (HD->SD, QoS network prioritisation, kitten filters, and a million others)
- Unearthed spare capacity (hidden inside under-optimised systems)
- Innate stability (the internet’s built-in ability to route around issues and keep going)
- Planned room for manoeuvre (spare capacity was already on hand because the internet is constantly expanding in advance of need).
Did CDNs save the world? Yes. But they didn’t do it alone. We used all the things—the ones mentioned previously, and countless others.
The internet is an example of something incredibly robust. To handle the pandemic, it was upgraded in real time in thousands of ways by engineers, product managers, marketeers, testers, and the users who accepted and worked around issues or degradations. In the face of a crisis, they all strove independently towards one goal: to keep things running.
The irony is, that didn’t work because the backbone network was a solid, reliable platform. Arguably, it worked because it wasn’t and hasn’t been for a long time. The internet is full of flaws from years of under-investment. The result of this built-in chaos engineering is every decent application that runs on it has to be capable of handling major issues.
It is easier for an application to switch from running on a 90% reliable platform to a 50% reliable one than it is to move from 100% reliability to 95%. None of the “keep going!” levers pulled by everyone during the crisis were new. It’s amazing what humanity can achieve when we have to, but things would have been very different if we’d started from an assumption of 100% internet reliability.
The overall lesson? The internet survived because both:
- Millions of individuals could help make that happen in thousands of ways.
- The change in flakiness wasn’t zero to one— it got much worse than usual, but it was an analog increase, not a binary one (this feels like a similar principle to chaos engineering).
There is no central authority directing the ‘net. Anyone can consume from it, add to it, and extend it. Citizens can enlarge their own slice by tethering their mobile phones or plugging in wireless routers or even their home wiring. Companies can upgrade it by building CDNs, or SD WANs, or clouds like AWS or Azure, or services like video conferencing with kitten-based graceful service downgrades, or by sticking data on trucks and driving them across the country.
The internet survived because it’s decentralised and democratic, not because it’s completely solid. That’s good news, because survival like that is not magic. It’s achievable.
Now, how do we do the same thing for carbon zero energy grids?