Skip to main content
Dark background with blue code overlay
Blog

Internet Resilience, Part 2: What It Takes to “Just Work”

James Kretchmar

Written by

James Kretchmar

November 08, 2021

One of the greatest signs of the success of the internet as a technology is how little the average person thinks about it. I’m not talking about the content itself. The streaming videos, online shopping sites, news and educational content, workplace productivity tools, and many other pieces of content we view and interact with online garner a great deal of attention. I’m instead talking about what’s involved in getting that content delivered reliably, quickly, and securely to a device. What seems to “just work” is in fact the result of an incredible amount of very careful engineering design.

I’ve called out reliability, speed, and security because these are the principal challenges in delivering content on the internet. Throughout the rest of this blog series, Akamai’s experts will dive into many different parts of the internet technology stack and will discuss what it takes to make content delivery work reliably at scale. I’d like to take a brief moment at the outset to lay out the problem space and discuss at a high level why reliability, speed, and security are challenges to begin with, and why they cannot be taken for granted on the internet.

In any significant engineering project, some key design decisions will have an important downstream impact on other aspects of the design. In the design of the internet, one of those key design decisions was to divide up content into small “packets” for the purpose of transmission, and to assume packets could get lost on the way from A to B. A packet might get lost due to a technical glitch like a malfunctioning piece of hardware, or it could be intentionally discarded if a piece of equipment is being driven past its capacity. While this may seem like a somewhat arbitrary design decision, it was arguably a key component of the internet’s success. It also may seem like a relatively small detail, but the downstream impact of this decision is that many aspects of internet technology are reasonable “best effort” by design and offer no guarantees, starting with packet delivery and moving up to bandwidth, latency, and other key performance characteristics. Providing a highly reliable, performant experience on top of this reasonable best-effort system is one of the interesting challenges we’ll explore in depth.

Another important aspect of the design of the internet is the way in which some of its fundamental systems are decentralized. Take, for example, Border Gateway Protocol (BGP), which is the system that manages routing across the internet. A packet on its way from A to B generally has many alternate paths available. Just like planning a road trip, some routes may be longer and some shorter, and the equipment that routes packets on the internet needs to learn the best paths among them. BGP is decentralized in the sense that there is no single, central system that decides on the best paths. Instead, each network sends information into a planetary-scale distributed algorithm that collectively calculates best paths. Note, however, that “best” here is a relative term. Each network may adjust the information sent into the algorithm to optimize for its own business needs, and the resulting paths are often not the best paths for performance. What’s more, the algorithm only considers a kind of distance metric of hops through networks, not performance characteristics like packet loss or throughput. There are great benefits to decentralized systems, and just like with the reasonable best-effort nature of packet delivery, one could argue the decentralized nature of BGP is a key element in the success of the internet. But decentralized systems do bring challenges to reliability, security, and performance that must be addressed.

One final aspect of the internet’s design that’s especially relevant to reliability is how its very highly distributed set of end users interacts with more centralized resources. Most of the underlying transmission mechanisms of the internet are one-to-one, point-to-point. In some limited cases, a hierarchical one-to-many tiering is used — for example, in the DNS system or multicast. But the vast majority of content is not delivered this way, and this presents a number of challenges. First and foremost is scale. If tens of millions of users are requesting content from a single, centralized service, both that service and the networks around it will become overwhelmed with the demand. Additionally, as we now know all too well, this imbalance also presents a security challenge. Millions of devices infected with malware can be commandeered to send attack traffic to a centralized resource, where it becomes overwhelmed with the accumulated concentration of traffic.

Akamai’s highly distributed edge platform addresses these problems, allowing users around the planet to request and receive content locally, avoiding bottlenecks. It also serves to absorb and block attack traffic at the edges of the internet, before it can become too concentrated to manage. But this brings us to one of the most interesting challenges of all: how to reliably coordinate a network of 350,000 servers that spans the far corners of the globe. This challenge begins with the fact that anything can, and will, go wrong with the elements of the system itself. Hardware faults, software bugs, and operator errors are inevitable, but all must be accounted for in creating a reliable system overall. On top of that, there are a number of very specialized challenges to operating in such a highly distributed manner. When things go wrong and some parts of the system are cut off from other parts of the system, how are decisions made, who’s in charge, and what’s the best way to continue providing service? Throughout the blog series we’ll visit this topic often, as it’s among the most interesting technologically and where we have unique perspective and expertise.

Through two decades of engineering by some of the world’s top experts in distributed systems, we’ve developed a platform that brings a high level of reliability and stability to an underlying system that by design is reasonable best effort. The result: People don’t have to think about how their videos, apps, and web experiences are delivered. They just work. With that background out of the way, I’ll turn it over to our experts to discuss the many interesting technologies that make it happen!



James Kretchmar

Written by

James Kretchmar

November 08, 2021