Akamai acquires LayerX, delivering end-to-end security and real-time AI usage control to any browser. Get details

Back Products Close

Cloud Computing

Cybersecurity

Content Delivery

See all products

Our Infrastructure

Global Services

Back Cloud Computing Close

Artificial intelligence (AI)

Akamai Inference Cloud

Storage

Object Storage

Block Storage

Backups

Databases

Managed Databases

compute

GPU

CPU

Kubernetes

App Platform

Accelerated Compute

Serverless

Akamai Functions

Networking

Cloud Firewall

DNS Manager

NodeBalancers

Private Networking

View cloud pricing

Explore plans and pricing that fit your needs — from small projects to global-scale deployments.

See pricing

Get started with Akamai Cloud

Sign up today and unlock cloud computing, edge, and AI tools built for your business.

Sign up

See all Cloud Computing

Back Cybersecurity Close

app and api security

API Security

App & API Protector

Firewall for AI

Client-Side Protection & Compliance

Bot & Agent Control

Account Protector

Content Protector

Bot Manager

AI Brand Presence

Segmentation

Akamai Guardicore Segmentation

zero trust security

Akamai Workforce Protector (formerly LayerX)

Secure Internet Access

Enterprise Application Access

Akamai MFA

Identity, Credential, and Access Management

infrastructure security

Edge DNS

Prolexic

IP Accelerator

DNS Posture Management

Brand Guardian

Get started with Security

Protect the applications that drive your business — every day, every time.

Contact Sales

See all Cybersecurity

Back Content Delivery Close

Application performance

Ion

API Acceleration

IP Accelerator

Media Delivery

Adaptive Media Delivery

Download Delivery

Edge Applications

EdgeWorkers

EdgeKV

Image & Video Manager

Media Services Live

Cloudlets

Cloud Wrapper

Global Traffic Management

Monitoring, reporting and testing

Data Stream

mPulse

CloudTest

Get started with Content Delivery

Trust the agility and scale of Akamai to help you flawlessly deliver extraordinary digital experiences.

Contact Sales

See all Content Delivery

Back Solutions Close

Cloud Computing

Serverless

Media

SaaS

Gaming

See all Cloud Computing

security

Frontier AI Security Risks

Akamai Application Protection Platform

Cybersecurity Compliance

Ransomware Protection

Secure Apps and APIs

DNS Delivery and Security

Zero Trust

DDoS Protection

Bot & Agent Control

Identity, Credential and Access Management

See all Cybersecurity

content delivery

App and API Performance

Media Delivery

See all Content Delivery

industry solutions

Media and Entertainment

Retail, Travel, and Hospitality

Financial Services

Healthcare and Life Sciences

Public Sector

Defense

Games

Online Sports Betting and iGaming

Service Providers

See all Industry Solutions

Back Pricing Close

Security and Delivery

Get started

Contact Sales

Free trials

Cloud pricing

GLOBAL PRICING

North America pricing

Europe pricing

Asia Pacific pricing

South America pricing

SPECIFIC LOCAL PRICING

Jakarta pricing

See all pricing

Cloud pricing

Try Akamai Cloud with US$100 in credits*

Deploy faster with global cloud infrastructure — no surprise bills, no lock-in, and transparent pricing across every data center.

Try now

*See Promotion Redemption Rules & Conditions

Back Developers Close

Cloud developers

Developer hub

Akamai GitHub repo

docs and guides

Cloud docs

Guides and tutorials

cloud marketplace

Developer apps

Get started with Akamai Cloud

Sign up today and unlock cloud computing, edge and AI tools built for your business.

Sign up

Back Resources Close

What’s new

Akamai blog

Events and workshops

Learning

White papers, ebooks, videos, product briefs

Customer stories

Training and certifications

Cybersecurity Research

Akamai Security Intelligence Group (SIG)

State of Internet (SOTI) reports

Partners

Partner with Akamai to innovate, scale, and grow your advantage

Channel Partners

Partner Portal

Partner Stories

Technology Partners

Technology Partners Directory

Log in

Back Log in Close

Cloud Manager
Manage your cloud computing services

Back Log in Close

Control Center
Manage your security and delivery services
- Docs
- Sales
- Support
- Under Attack ?
English
Back Language Close
- English
- Deutsch
- Español
- Français
- Italiano
- Português
- 中文
- 日本語
- 한국어

Create account

Under Attack?

Akamai Cloud

Akamai Security and Delivery

Connect with our Sales team to discuss your business needs and find the right solutions.

Contact Sales

Why Resilient Systems Design Is Critical for Cloud Reliability

Jun 22, 2026

James Kretchmar

Written by

James Kretchmar

James Kretchmar is Senior Vice President and Chief Technology Officer for Cloud Technology at Akamai.

Cloud reliability depends on designing systems for resilience rather than assuming flawless code, because software bugs are inevitable.
Systems architecture brings added complexity but also significant opportunities for advanced resilience.
Resilient architecture prevents minor bugs from escalating into major outages through varied techniques, such as redundancy, fault isolation, graceful degradation, automated incremental rollouts, and fail-safe modes.
Combining multiple resilience techniques is an effective and efficient way to create high availability services, although it remains vital to maintain high baseline reliability for each individual layer.

Welcome back to our reliability series

A few months ago, we set out to explore the subject of reliability in depth — from software and system complexity to organizational reliability culture, human factors considerations, and more — in a series of blog posts.

The first post in this blog series discussed in detail the ways in which cloud services depend on software, and how that software — even very high-quality software — will inevitably have bugs. How then can we build highly reliable cloud services if the software they’re composed of is imperfect? The first part of the answer lies in one of the most fundamentally important elements of an engineer’s job: designing for resilience.

In this blog post, the second in our series, we’ll dive into why resilient systems design is critical for cloud reliability.

Designing for resilience

Designing for resilience is a key part of every engineering discipline, not just software engineering. Civil engineers, for example, construct bridges to support more than their target load to account for unexpected conditions that could lead to failure. Similarly, mechanical engineers build elevators with special speed breaks to prevent them from falling even if the main cable breaks. And aerospace engineers use multiple flight control computers in concert in case one malfunctions in a spacecraft.

The theme across these examples is that the engineer is responsible for understanding where an unexpected but relatively likely problem could lead to a catastrophic failure, and for designing resilience against that circumstance.

In the world of software engineering, bugs are one of the most important problematic conditions to account for. This is partly because they are so commonplace. As an industry, we’ve come to accept that we can’t eliminate bugs definitively. That doesn’t mean we don’t try. In fact, we absolutely must strive for software to be free of bugs, for reasons we’ll address later in this post. But we know that we will fall short of that impossible goal, and our systems must continue to operate well when they occur.

Standalone software resilience

Before we get to the sophisticated resilience techniques that cloud services employ, it’s important to first note that stand-alone, non-cloud software also employs techniques to protect against bugs.

For a simple example, imagine a program containing a very complicated numerical calculation that’s expected to produce a number with certain properties, such as being greater than zero. A common defensive technique would be to validate that the result is positive before using it in further calculations. If a negative value is returned, the calling routine can take an action to safely stop and alert the user.

Although this particular example is exceedingly simple, the approach behind it is a powerful one: Validating conditions that must be true are true, because if they’re not, a bug must be present. This and many other resilience techniques are the fundamentals of software engineering.

From singletons to systems

Of course, much of modern software, especially in the cloud, is not composed of a single, monolithic, stand-alone program, but instead many separate pieces of software working together in real time. This can be true on a single machine, where several processes may be cooperating, or it can be software components working together across many machines within a rack, in a data center, or even distributed across the globe.

When we’re dealing with multiple pieces of software interacting with one another, the engineering vocabulary shifts from talking about software to talking about systems. In fact, when I studied Computer Science at the Massachusetts Institute of Technology in the late 1990s, these topics were the focus of two separate classes — one, which we called 6.170, was focused on software engineering; another, called 6.033, addressed systems engineering.

Systems engineering was its own class because when there are multiple autonomous pieces of software interacting with one another, it introduces a new set of considerations for engineers, including:

How do you synchronize data across instances when network delays among them are significant?
How do you ensure agreement among a set of instances on a decision they independently contribute to?
What happens when communications are disrupted among a subset of instances?

Fortunately, systems don’t only present new problems. They also present new opportunities for resilience.

Consider another simple example: a piece of software running in the cloud that loads a large dataset at startup, processes user inputs, performs calculations, and returns output to the user. Now, let’s say that this program has a bug that very occasionally causes it to crash, leading to downtime. A mechanism to restart the program automatically after a crash is generally a good resilience technique, but because this program takes a long time to reload the dataset before it can provide service, users still face disruptive downtime during the restart.

Systems to the rescue! Instead of running one instance of the software, let’s run two. When the primary instance crashes, the secondary instance, which already has the necessary data loaded and ready to go, takes over. We need to add some logic to direct users quickly to the secondary instance, but this is an effective strategy.

Now consider an additional problem: What if the bug that crashes the program is based on the input of a particular user who unknowingly crashes the primary software instance, but then retries the request, which goes on to also crash the secondary instance?

Systems architecture offers interesting ways to isolate the risk. For example, if you run 20 software instances, and route each user to only two of them, a single user’s bad input can damage at most 10% of the service — leaving the other 90% unaffected. More sophisticated techniques can optimize this even further.

Keeping small problems from becoming big problems

Fault isolation in the previous example is one technique that illustrates a broader goal of resiliency: Ensuring that a small problem doesn’t become a big one.

Electrical engineers use circuit breakers to keep a short circuit from becoming a fire. And hardware engineers build extra parity bits into the memory used in critical applications so that a stray cosmic ray doesn’t corrupt memory and lead to failure of a control system.

Software and systems engineers employ techniques such as redundancy, fault isolation, graceful degradation, automated incremental rollouts, and fail-safe modes to prevent minor bugs from escalating into major outages. These techniques are backed by foundational approaches such as providing telemetry for rapid problem detection and remediation. The details of these techniques fill courses like MIT’s 6.033. Later in this blog series, we’ll take a deeper dive into a few of the most critical ones.

Modeling systems reliability

Individual techniques aside, it’s important to think about how they come together in the aggregate so we can understand what kind of resilience these techniques provide to a system as a whole.

To start, consider a system with a primary component (system A) and a backup version of the component (system B). For example, if system A is a database that fails, a system B database can take over to provide continuity of service. The existence of system B makes the overall system more resilient.

We can extend this system B concept more generally. If we're concerned about a component failing when it receives a bad configuration update, the mitigation may not be an additional component, but rather an automated safety mechanism within system A to test and/or automatically reject a faulty configuration.

In this way, systems resilience can be modeled through generalized mitigations that address various failure modes.

There are a number of mathematical techniques to describe the reliability of this type of “A and B” system as a whole. One simplistic approach is to:

Consider the likelihood of component A failing over a given period
Consider the likelihood of mitigator B failing over a given period
Model the potential of double failure during that period as the product of the two (at least for the simple case in which the reliability of A and B are fully independent of each other)

If we’ve made the likelihood of failure a very small percentage for A, and a very small percentage for B, we get an exceedingly small likelihood of both failing as the product of the two.

This fits our intuition, that a double failure is less likely than a single failure. But even this simple model reveals two critical points:

It’s often far less expensive to combine two relatively reliable systems into a configuration that’s highly reliable than it is to drive a single component to the same level of overall reliability.
It’s vitally important that systems A and B are both very reliable on their own. Because the overall system reliability is the product of the reliability of each one, a minor degradation in the reliability of each component results in an overall impact that goes up with the square of the degradation. This is why even though bugs are inevitable, and we build resiliency to account for it, we still must strive to make software bug free. If we become lax with the reliability of individual components, the double failure becomes much more likely. We’ll explore how organizational dynamics can lead to this circumstance in a future blog post.

The “A and B” system model above is, of course, overly simplistic for a number of reasons, including that detecting and mitigating some failures requires more than two systems. But, more than that, this model is also now understood to be problematic in other ways.

For example, there are very often factors that do make the likelihood of failures of A or B dependent on each other in ways that are difficult to identify. Furthermore, a probabilistic computation very often does not properly account for low-probability, high-impact risks.

So, although the model above is not in and of itself a recommended practical method for determining systems reliability, it is useful for illustrating some of the foundational principles.

The Swiss cheese model of accident causation

Resilient cloud software relies on a strategy of defense in depth, employing multiple layers of resilience technology. A software component is tested thoroughly, designed with resilient software engineering techniques, deployed in an environment using systems architecture principals that protect it from problematic conditions around it, instantiated within a broader fault-tolerant architecture, and so on.

This strategy is often called the “Swiss cheese” model. Any single slice of Swiss cheese has holes — which, in this case, represent the paths through which failures can occur. But a stack of slices, each with holes in different places, will hopefully leave no opening that goes all the way through the stack. It’s a neat visual representation of how layers of protection provide defense in depth: not just an A and B system, but an A, B, C, D, and E (etc.) system.

However, this model is also overly simplistic for real-world use: another topic we’ll explore in a future post. Still, it does illustrate the foundational concept of layered defenses for resilience.

Highly reliable cloud services

We began this post by asking how we can build highly reliable cloud services using imperfect software that will inevitably have bugs. Part of the answer is now clear: We must design for resilience at both the software and systems level.

Akamai has prioritized this for more than 25 years, knowing that our customers depend on our platform to operate with the highest levels of reliability to keep their businesses online and running smoothly.

Yet, resilient systems design remains only one piece of the puzzle to providing highly reliable cloud services. Focusing on this piece alone isn’t enough. What else is missing, and what happens without it? I’ll discuss the next key piece of the puzzle in my next blog post.

Stay tuned.

Jun 22, 2026

James Kretchmar

Written by

James Kretchmar

James Kretchmar is Senior Vice President and Chief Technology Officer for Cloud Technology at Akamai.

View cloud pricing

Get started with Akamai Cloud

Get started with Security

Get started with Content Delivery

Security and Delivery

Cloud pricing

Cloud pricing

Try Akamai Cloud with US$100 in credits*

Get started with Akamai Cloud

Partners

Akamai Cloud

Akamai Security and Delivery

Why Resilient Systems Design Is Critical for Cloud Reliability

Key takeaways

Welcome back to our reliability series

Designing for resilience

Standalone software resilience

From singletons to systems

Keeping small problems from becoming big problems

Modeling systems reliability

The Swiss cheese model of accident causation

Highly reliable cloud services

Related Blog Posts