Key takeaways
Cloud reliability depends on designing systems for resilience rather than assuming flawless code, because software bugs are inevitable.
Systems architecture brings added complexity but also significant opportunities for advanced resilience.
Resilient architecture prevents minor bugs from escalating into major outages through varied techniques, such as redundancy, fault isolation, graceful degradation, automated incremental rollouts, and fail-safe modes.
Combining multiple resilience techniques is an effective and efficient way to create high availability services, although it remains vital to maintain high baseline reliability for each individual layer.
Welcome back to our reliability series
A few months ago, we set out to explore the subject of reliability in depth — from software and system complexity to organizational reliability culture, human factors considerations, and more — in a series of blog posts.
The first post in this blog series discussed in detail the ways in which cloud services depend on software, and how that software — even very high-quality software — will inevitably have bugs. How then can we build highly reliable cloud services if the software they’re composed of is imperfect? The first part of the answer lies in one of the most fundamentally important elements of an engineer’s job: designing for resilience.
In this blog post, the second in our series, we’ll dive into why resilient systems design is critical for cloud reliability.
Designing for resilience
Designing for resilience is a key part of every engineering discipline, not just software engineering. Civil engineers, for example, construct bridges to support more than their target load to account for unexpected conditions that could lead to failure. Similarly, mechanical engineers build elevators with special speed breaks to prevent them from falling even if the main cable breaks. And aerospace engineers use multiple flight control computers in concert in case one malfunctions in a spacecraft.
The theme across these examples is that the engineer is responsible for understanding where an unexpected but relatively likely problem could lead to a catastrophic failure, and for designing resilience against that circumstance.
In the world of software engineering, bugs are one of the most important problematic conditions to account for. This is partly because they are so commonplace. As an industry, we’ve come to accept that we can’t eliminate bugs definitively. That doesn’t mean we don’t try. In fact, we absolutely must strive for software to be free of bugs, for reasons we’ll address later in this post. But we know that we will fall short of that impossible goal, and our systems must continue to operate well when they occur.
Standalone software resilience
Before we get to the sophisticated resilience techniques that cloud services employ, it’s important to first note that stand-alone, non-cloud software also employs techniques to protect against bugs.
For a simple example, imagine a program containing a very complicated numerical calculation that’s expected to produce a number with certain properties, such as being greater than zero. A common defensive technique would be to validate that the result is positive before using it in further calculations. If a negative value is returned, the calling routine can take an action to safely stop and alert the user.
Although this particular example is exceedingly simple, the approach behind it is a powerful one: Validating conditions that must be true are true, because if they’re not, a bug must be present. This and many other resilience techniques are the fundamentals of software engineering.
From singletons to systems
Of course, much of modern software, especially in the cloud, is not composed of a single, monolithic, stand-alone program, but instead many separate pieces of software working together in real time. This can be true on a single machine, where several processes may be cooperating, or it can be software components working together across many machines within a rack, in a data center, or even distributed across the globe.
When we’re dealing with multiple pieces of software interacting with one another, the engineering vocabulary shifts from talking about software to talking about systems. In fact, when I studied Computer Science at the Massachusetts Institute of Technology in the late 1990s, these topics were the focus of two separate classes — one, which we called 6.170, was focused on software engineering; another, called 6.033, addressed systems engineering.
Systems engineering was its own class because when there are multiple autonomous pieces of software interacting with one another, it introduces a new set of considerations for engineers, including:
How do you synchronize data across instances when network delays among them are significant?
How do you ensure agreement among a set of instances on a decision they independently contribute to?
What happens when communications are disrupted among a subset of instances?
Fortunately, systems don’t only present new problems. They also present new opportunities for resilience.
Consider another simple example: a piece of software running in the cloud that loads a large dataset at startup, processes user inputs, performs calculations, and returns output to the user. Now, let’s say that this program has a bug that very occasionally causes it to crash, leading to downtime. A mechanism to restart the program automatically after a crash is generally a good resilience technique, but because this program takes a long time to reload the dataset before it can provide service, users still face disruptive downtime during the restart.
Systems to the rescue! Instead of running one instance of the software, let’s run two. When the primary instance crashes, the secondary instance, which already has the necessary data loaded and ready to go, takes over. We need to add some logic to direct users quickly to the secondary instance, but this is an effective strategy.
Now consider an additional problem: What if the bug that crashes the program is based on the input of a particular user who unknowingly crashes the primary software instance, but then retries the request, which goes on to also crash the secondary instance?
Systems architecture offers interesting ways to isolate the risk. For example, if you run 20 software instances, and route each user to only two of them, a single user’s bad input can damage at most 10% of the service — leaving the other 90% unaffected. More sophisticated techniques can optimize this even further.
Keeping small problems from becoming big problems
Fault isolation in the previous example is one technique that illustrates a broader goal of resiliency: Ensuring that a small problem doesn’t become a big one.
Electrical engineers use circuit breakers to keep a short circuit from becoming a fire. And hardware engineers build extra parity bits into the memory used in critical applications so that a stray cosmic ray doesn’t corrupt memory and lead to failure of a control system.
Software and systems engineers employ techniques such as redundancy, fault isolation, graceful degradation, automated incremental rollouts, and fail-safe modes to prevent minor bugs from escalating into major outages. These techniques are backed by foundational approaches such as providing telemetry for rapid problem detection and remediation. The details of these techniques fill courses like MIT’s 6.033. Later in this blog series, we’ll take a deeper dive into a few of the most critical ones.
Modeling systems reliability
Individual techniques aside, it’s important to think about how they come together in the aggregate so we can understand what kind of resilience these techniques provide to a system as a whole.
To start, consider a system with a primary component (system A) and a backup version of the component (system B). For example, if system A is a database that fails, a system B database can take over to provide continuity of service. The existence of system B makes the overall system more resilient.
We can extend this system B concept more generally. If we're concerned about a component failing when it receives a bad configuration update, the mitigation may not be an additional component, but rather an automated safety mechanism within system A to test and/or automatically reject a faulty configuration.
In this way, systems resilience can be modeled through generalized mitigations that address various failure modes.
There are a number of mathematical techniques to describe the reliability of this type of “A and B” system as a whole. One simplistic approach is to:
Consider the likelihood of component A failing over a given period
Consider the likelihood of mitigator B failing over a given period
Model the potential of double failure during that period as the product of the two (at least for the simple case in which the reliability of A and B are fully independent of each other)
If we’ve made the likelihood of failure a very small percentage for A, and a very small percentage for B, we get an exceedingly small likelihood of both failing as the product of the two.
This fits our intuition, that a double failure is less likely than a single failure. But even this simple model reveals two critical points:
It’s often far less expensive to combine two relatively reliable systems into a configuration that’s highly reliable than it is to drive a single component to the same level of overall reliability.
It’s vitally important that systems A and B are both very reliable on their own. Because the overall system reliability is the product of the reliability of each one, a minor degradation in the reliability of each component results in an overall impact that goes up with the square of the degradation. This is why even though bugs are inevitable, and we build resiliency to account for it, we still must strive to make software bug free. If we become lax with the reliability of individual components, the double failure becomes much more likely. We’ll explore how organizational dynamics can lead to this circumstance in a future blog post.
The “A and B” system model above is, of course, overly simplistic for a number of reasons, including that detecting and mitigating some failures requires more than two systems. But, more than that, this model is also now understood to be problematic in other ways.
For example, there are very often factors that do make the likelihood of failures of A or B dependent on each other in ways that are difficult to identify. Furthermore, a probabilistic computation very often does not properly account for low-probability, high-impact risks.
So, although the model above is not in and of itself a recommended practical method for determining systems reliability, it is useful for illustrating some of the foundational principles.
The Swiss cheese model of accident causation
Resilient cloud software relies on a strategy of defense in depth, employing multiple layers of resilience technology. A software component is tested thoroughly, designed with resilient software engineering techniques, deployed in an environment using systems architecture principals that protect it from problematic conditions around it, instantiated within a broader fault-tolerant architecture, and so on.
This strategy is often called the “Swiss cheese” model. Any single slice of Swiss cheese has holes — which, in this case, represent the paths through which failures can occur. But a stack of slices, each with holes in different places, will hopefully leave no opening that goes all the way through the stack. It’s a neat visual representation of how layers of protection provide defense in depth: not just an A and B system, but an A, B, C, D, and E (etc.) system.
However, this model is also overly simplistic for real-world use: another topic we’ll explore in a future post. Still, it does illustrate the foundational concept of layered defenses for resilience.
Highly reliable cloud services
We began this post by asking how we can build highly reliable cloud services using imperfect software that will inevitably have bugs. Part of the answer is now clear: We must design for resilience at both the software and systems level.
Akamai has prioritized this for more than 25 years, knowing that our customers depend on our platform to operate with the highest levels of reliability to keep their businesses online and running smoothly.
Yet, resilient systems design remains only one piece of the puzzle to providing highly reliable cloud services. Focusing on this piece alone isn’t enough. What else is missing, and what happens without it? I’ll discuss the next key piece of the puzzle in my next blog post.
Stay tuned.
Tags