Executive summary
Cloud reliability is increasingly critical as recent outages have caused major disruptions across industries.
Outages are often caused by latent software defects, not just operator error or cyberattacks.
The inherent complexity and layered nature of cloud software make perfect, bug-free systems unattainable.
Ensuring high-quality software is essential to providing reliable cloud services, but organizations must also design systems to ensure high availability despite the inevitability of software defects.
Future posts in this series will address strategies for managing complexity and building resilient cloud systems.
Welcome to our reliability series
Over the last few years, the world has witnessed a number of notable cloud and IT service outages. Although the public has come to expect occasional technology disruptions, these outages were different in their breadth and depth of impact. They led to canceled flights, disrupted workplace productivity, malfunctioning connected devices in the home, and more.
As CTO for Cloud Technology at Akamai, I recently spoke with journalists, colleagues, and industry leaders who are eager to understand why the industry is seeing more outages and to learn what the future may hold.
At Akamai, we place a significant focus on reliability. We run the world’s most distributed platform for cloud computing, security, and content delivery, and our customers depend on us to keep their businesses online and running smoothly.
This blog series explores the subject of reliability in depth, covering a wide range of topics from software and system complexity to organizational reliability culture, the so-called Swiss Cheese model of accident causation, human factors considerations, and more.
The content is based on our 25+ years of experience running a network of hundreds of thousands of servers in more than 4,400 points of presence around the world that power many of the world’s most popular and business-critical online experiences.
Reliability and the software dilemma
During one of the recent, widely reported-on cloud outages, I turned on the TV news in an attempt to understand the breadth of impact around the world. I was surprised to hear an invited expert commenting that it was too early to tell if the outage was caused by operator error or a cyberattack.
Were those the only two options? Although operator error is a common trigger for technical incidents and cyberattacks do occur, there are other causes for outages. A frequent culprit is a latent software defect — a mistake coded and deployed months or years in advance that’s been waiting for just the right conditions to reveal itself. I found myself yelling at the TV about this possibility and, as it turns out, a latent software defect was indeed the cause.
The role of software in the reliability of cloud services is often overlooked or misunderstood for three principal reasons:
The software that powers cloud services usually extends far beyond the code for an application in front of an end user. Layers upon layers of infrastructure support the application, each powered by its own software stack.
Software plays a critical role not just in delivering features of the application or infrastructure service, but in managing a dynamic system in which multiple independent components are communicating with one another and impacting the others’ behavior in complex and sometimes unpredictable ways.
It’s incredibly hard to build flawless, perfect software. In all but very specialized circumstances, it would not be technically or economically feasible.
On top of all that, some of the other triggers for outages, such as operator error or hardware failures, are themselves exacerbated by software problems. Some operator errors are impactful primarily because they trigger a latent defect in software. Or the software systems designed to route around bad hardware may themselves fail.
Future blog posts in this series will dive into the first two topics above — infrastructure and systems complexity. This post will focus on why it’s so difficult to build perfect software in the first place.
You’ll note, by the way, that I have so far avoided using the word “bug” and instead have talked about software defects. This is to distinguish between different types of software problems:
If the author of a piece of software codes it in such a way that it performs exactly the steps intended by the programmer, but those steps are a fundamentally flawed approach to solving the problem at hand, the mistake would be more accurately described as a design flaw.
If the author has a fundamentally sound approach but has a small glitch in the instructions, such as an off-by-one math error in the code, that can be clearly described as a bug.
While the difference is not always quite so clear-cut, an upcoming blog post on systems complexity will focus on design flaws, while the remainder of this post will focus on bugs.
What is software and why is it so buggy?
Anyone who’s old enough to remember floppy disks and CD-ROMs knows the world became accustomed to software bugs long before the cloud existed. Given the incredible advances in so many areas of technology, it would be fair to ask why software still has so many bugs all these years later. The multifaceted answer is based in part on what software is and in part on the ecosystem around it.
Advantages of the reprogrammable machine
The concept of software is rooted in the difference between a single-purpose machine and a multipurpose machine. For much of human history, tools and other devices were built to accomplish a single, solitary task or a small number of very specific tasks.
Over time, inventors found ways to build machines that could be flexibly reconfigured to perform many tasks. An early example was the Jacquard loom, which used chains of cards with holes punched out to control the pattern woven into textiles. This began an evolution that culminated in the modern CPU, a marvel not only for its size and speed of computation, but also for being so readily reprogrammed to be any other machine.
The advantages of this flexibility are hard to overstate. Software took us to the moon, developed life-saving medicines, sequenced the human genome, and gave us the ability to communicate and share information across the globe with unprecedented scale and speed.
This flexibility has also dramatically reduced the cost to make improvements and repair defects. Instead of an expensive site visit or the costly shipment of a physical device, we transitioned to sending CD-ROMs, then delivering software updates over the internet, followed by updating the software behind cloud services without any end-user interaction at all.
The impact on prevention and repair
But the reduced cost to repair defects carries a hidden consequence. When a machine is difficult or expensive to repair, there is a significant incentive to invest in up-front prevention to reduce or avoid the need for repairs.
When the cost to repair becomes sufficiently low, the incentive structure changes and it becomes not just advantageous, but generally inevitable that one must take advantage of the effect by shifting some of the pre-deployment investment to post-deployment repairs. In other words: Sacrifice some amount of up-front quality and fix problems later.
This is in part reflected in the oft-cited trope that being first to market wins, but it goes far deeper than that. Today, we ask software to perform tasks that would be of mind-boggling complexity if they were built into physical, mechanical systems (if they were possible to construct at all).
In software, however, we don’t generally have a way to drive an analogous level of up-front perfection as we might with a physical system. It’s easier to assure the proper functioning of a mechanical wristwatch than a piece of software with millions of lines of code. Even if we wanted to, we would have to backtrack on many important areas of technological progress to do so.
The abstraction barrier and constant change
One of the principal ways software engineers manage this complexity is through the use of an abstraction barrier; that is, the concept of carving out a chunk of a problem into its own little “black box” with a clear interface in and out, and all the complexity needed to perform that chunk inside the box, contained in one place and hidden from everyone else.
The use of an abstraction barrier is an extremely powerful tool for breaking down what would otherwise be overwhelmingly complex problems into manageable pieces — and virtually any project of appreciable complexity employs this technique. It’s used up and down the stack: from within a piece of software to external libraries (relied on for common functionality, like date handling or string manipulation) to the operating system itself (which serves as an abstracted-away environment for the software to run on). In fact there’s an old saying that every hard problem in computer engineering is solved by adding an additional layer of abstraction.
But this too carries a drawback, rooted in two problems:
Every once in a while, the abstraction barrier is not as clean as expected. An extreme example was the Spectre vulnerability, in which there were subtle ways that information that was supposed to be contained “inside” the black box could leak out, with problematic security implications. Another example was the Shellshock vulnerability, in which the interface between two sides of the abstraction barrier was not as tightly controlled as believed, which led to a serious vulnerability in web servers around the world.
The software on both sides of the abstraction barrier needs to be updated over time to repair defects, for reasons discussed above.
These two problems combine to create an ever-changing environment around any piece of software. This constant evolution is one of the primary reasons why software is never “done” and always carries some maintenance cost.
Even if you feature-freeze a piece of software by dictating that the task of the software and its approach must remain the same, the system around it can change in ways that lead to unanticipated behaviors. And that’s an idealized case. Quite often the task evolves over time as use cases expand. (We’ll address this topic in depth in an upcoming post on technical debt.)
The path to reliability
One of my colleagues at Akamai would note at this point that we’ve spent an awful lot of time “admiring the problem.” The question is what to do about it. Do we throw up our hands and say it’s just too hard? Absolutely not. Half the discipline of engineering is achieving the desired level of quality under the conditions you have, not the conditions you want.
Obviously, techniques for improving the quality of the software are the starting point. Countless books and courses attack this topic in depth, ranging from the use of safe coding techniques, correctness verification tools, testing strategies, review and teaming approaches, language selection, and more.
But to create a highly reliable cloud service, these are only the starting point. Ensuring high-quality software is absolutely necessary, but not at all sufficient. If we assume, and rightly so, that even very high-quality software will have bugs, the question becomes: How do we ensure that those bugs don’t lead to large-scale outages?
Answering that question is our mission and will be the focus of the rest of this blog series. Stay tuned.
Tags