Akamai acquires LayerX, delivering end-to-end security and real-time AI usage control to any browser. Get details

Back Products Close

Cloud Computing

Cybersecurity

Content Delivery

See all products

Our Infrastructure

Global Services

Back Cloud Computing Close

Artificial intelligence (AI)

Akamai Inference Cloud

Storage

Object Storage

Block Storage

Backups

Databases

Managed Databases

compute

GPU

CPU

Kubernetes

App Platform

Accelerated Compute

Serverless

Akamai Functions

Networking

Cloud Firewall

DNS Manager

NodeBalancers

Private Networking

View cloud pricing

Explore plans and pricing that fit your needs — from small projects to global-scale deployments.

See pricing

Get started with Akamai Cloud

Sign up today and unlock cloud computing, edge, and AI tools built for your business.

Sign up

See all Cloud Computing

Back Cybersecurity Close

app and api security

API Security

App & API Protector

Firewall for AI

Client-Side Protection & Compliance

Bot & Agent Control

Account Protector

Content Protector

Bot Manager

AI Brand Presence

Segmentation

Akamai Guardicore Segmentation

zero trust security

Akamai Workforce Protector (formerly LayerX)

Secure Internet Access

Enterprise Application Access

Akamai MFA

Identity, Credential, and Access Management

infrastructure security

Edge DNS

Prolexic

IP Accelerator

DNS Posture Management

Brand Guardian

Get started with Security

Protect the applications that drive your business — every day, every time.

Contact Sales

See all Cybersecurity

Back Content Delivery Close

Application performance

Ion

API Acceleration

IP Accelerator

Media Delivery

Adaptive Media Delivery

Download Delivery

Edge Applications

EdgeWorkers

EdgeKV

Image & Video Manager

Media Services Live

Cloudlets

Cloud Wrapper

Global Traffic Management

Monitoring, reporting and testing

Data Stream

mPulse

CloudTest

Get started with Content Delivery

Trust the agility and scale of Akamai to help you flawlessly deliver extraordinary digital experiences.

Contact Sales

See all Content Delivery

Back Solutions Close

Cloud Computing

Serverless

Media

SaaS

Gaming

See all Cloud Computing

security

Frontier AI Security Risks

Akamai Application Protection Platform

Cybersecurity Compliance

Ransomware Protection

Secure Apps and APIs

DNS Delivery and Security

Zero Trust

DDoS Protection

Bot & Agent Control

Identity, Credential and Access Management

See all Cybersecurity

content delivery

App and API Performance

Media Delivery

See all Content Delivery

industry solutions

Media and Entertainment

Retail, Travel, and Hospitality

Financial Services

Healthcare and Life Sciences

Public Sector

Defense

Games

Online Sports Betting and iGaming

Service Providers

See all Industry Solutions

Back Pricing Close

Security and Delivery

Get started

Contact Sales

Free trials

Cloud pricing

GLOBAL PRICING

North America pricing

Europe pricing

Asia Pacific pricing

South America pricing

SPECIFIC LOCAL PRICING

Jakarta pricing

See all pricing

Cloud pricing

Try Akamai Cloud with US$100 in credits*

Deploy faster with global cloud infrastructure — no surprise bills, no lock-in, and transparent pricing across every data center.

Try now

*See Promotion Redemption Rules & Conditions

Back Developers Close

Cloud developers

Developer hub

Akamai GitHub repo

docs and guides

Cloud docs

Guides and tutorials

cloud marketplace

Developer apps

Get started with Akamai Cloud

Sign up today and unlock cloud computing, edge and AI tools built for your business.

Sign up

Back Resources Close

What’s new

Akamai blog

Events and workshops

Learning

White papers, ebooks, videos, product briefs

Customer stories

Training and certifications

Cybersecurity Research

Akamai Security Intelligence Group (SIG)

State of Internet (SOTI) reports

Partners

Partner with Akamai to innovate, scale, and grow your advantage

Channel Partners

Partner Portal

Partner Stories

Technology Partners

Technology Partners Directory

Log in

Back Log in Close

Cloud Manager
Manage your cloud computing services

Back Log in Close

Control Center
Manage your security and delivery services
- Docs
- Sales
- Support
- Under Attack ?
English
Back Language Close
- English
- Deutsch
- Español
- Français
- Italiano
- Português
- 中文
- 日本語
- 한국어

Create account

Under Attack?

Akamai Cloud

Akamai Security and Delivery

Connect with our Sales team to discuss your business needs and find the right solutions.

Contact Sales

Distributed AI Inference: Why Placement Is the New Bottleneck

May 27, 2026

Alex Leung

Written by

Alex Leung

Alex Leung is a Senior Enterprise Architect at Akamai Technologies. With over nine years at Akamai, Alex has been instrumental in advancing the capabilities of streaming high-quality media content through Akamai.

The shifting landscape of AI infrastructure reveals that bottlenecks are no longer found in raw compute, but in inference placement.
As models scale, a unified, three-layer architecture (including hyperscale cloud, regional data centers, and edge nodes) is replacing the traditional “cloud vs. edge” debate.
Because preprocessing and embedding are now primary bottlenecks, compute must live near the data source to reduce bandwidth costs.
Distributed architectures mitigate power, cooling, and water use limits by spreading thermal loads across smaller facilities.
Success depends on “placement flexibility” — the ability to route workloads based on payload size, hardware needs, and traffic spikes.
Ultimately, maintaining a viable AI system requires a flexible control plane that can adapt as bottlenecks inevitably migrate across the infrastructure stack.

In April 2026, I spoke on a panel at the Taiwan Cloud & Datacenter Convention about edge AI, why intelligence is moving closer to data, and what that shift means for infrastructure. The conversation covered a lot of ground, but one idea kept surfacing, and it's the one I want to spend some time on here:

In real AI systems, bottlenecks don't disappear. They move.

That sounds like a minor observation. It isn't. It's the thing that determines whether your inference architecture remains viable at scale or quietly collapses under its own weight six months after you ship.

What actually slows down an AI system?

When most teams start building an AI-powered application, they expect the model itself to be the expensive part. They believe that because training is expensive, inference must be too. And since GPUs are expensive, GPU-bound work must be the bottleneck. That intuition is reasonable, and for a while it was correct.

It's no longer correct in most of the systems I've looked at recently.

Case study: Image search

For example, we built an image search application — the kind of thing in which a user uploads a photo and the system returns visually similar results. It's a standard architecture: Generate an embedding for the query, run a vector search against an index, and return the top matches.

When we profiled it, the costs broke down roughly like this:

Vector search latency: Negligible
Embedding generation: Dominant cost
Data movement and preprocessing: Nontrivial

Retrieval was fast. The GPUs were powerful. The model was fine, but the bottleneck had moved to embedding, the step where raw input gets converted into the vector representation the rest of the system operates on.

This pattern isn't unique to image search. Across retrieval-augmented generation (RAG), recommendation systems, multimodal pipelines, and agentic workflows, the story is similar: The parts of the stack we spent a decade optimizing are no longer the slow parts. The slow parts are the ones closest to the data.

Why this matters for where you put compute

Once you accept that bottlenecks move, the follow-up question is the interesting one: Where should the expensive work actually happen?

If embedding is the dominant cost, and embedding operates on raw input data, then the compute wants to be near the data. Moving raw data to a distant region to embed it, then moving the embeddings back, wastes bandwidth, adds latency, and compounds at every step.

The traditional drivers for edge computing still apply:

Latency: Users notice
Bandwidth efficiency: Moving data costs money
Data locality: Some data can't or shouldn't leave a region

But the newer driver — and, in my view, the more decisive one for AI — is computational placement. You want the expensive step to happen where it's cheapest to run. And for embedding-heavy workloads, that's close to the source.

This architectural pattern is one of the reasons Akamai Cloud is built the way it is. We have a globally distributed footprint with compute available near where data is generated. This is exactly the shape the workload wants.

Edge and cloud are not competing

One of the more outdated framings in this space is “edge vs. cloud.” Today, that’s the wrong frame.

According to the panelists, the cleaner way to view the market is is through a unified, three-layer architecture model:

Hyperscale cloud for model training: You need massive parallelism and the ability to tolerate long-running jobs.
Regional data centers for large-scale inference: You need serious compute but latency to the user is only moderately sensitive.
Edge nodes for real-time, latency-sensitive decisions: The round trip to a region is the limiting factor.

Each layer optimizes for a different constraint.

Training wants density.
Regional inference wants throughput.
Edge inference wants proximity.

You don't pick one; you decide which workloads belong at which layer, and you architect for all three.

The teams that are getting this right aren't asking “cloud or edge?” They're asking “Which parts of this pipeline belong where?” That's a harder question to answer, but it's the correct one.

The infrastructure reality nobody wants to talk about

Here's the part of the panel discussion that got the most nods from the infrastructure people in the room — and the least attention from the AI-framing crowd: Power and cooling are becoming the binding constraint.

Edge sites in traditional edge computing deployments were historically CPU-dominated. GPUs change that. A rack of modern accelerators draws more power and generates more heat than anything most edge facilities were designed for. Water use, which almost nobody was thinking about five years ago, is now a real consideration in certain regions.

The implication is uncomfortable for anyone planning an AI infrastructure strategy around “We'll just build bigger data centers." You can't build them fast enough, and in some places you can't build them at all. Power connections can take years. Cooling capacity is finite. Permitting is slow.

Distributed architecture to the rescue

A distributed architecture helps here in a way that's easy to miss. When you spread workloads across many smaller facilities rather than concentrating them in a few megasites, you're not only distributing compute but also distributing thermal load, power draw, and water use. Each individual site stays within its envelope. The aggregate capacity scales without any single facility having to triple in size.

This is a genuinely different answer to the capacity problem than building another hyperscale region, and it's one of the few arguments for distributed infrastructure that remains even if you don't care about latency.

Framework for workload placement decisions

The abstract version of this is easy: Put the work where it’s cheapest to run. The operational version is harder, because “cheapest” depends on what you're optimizing for on any given day.

In practice, placement for inference workloads comes down to a handful of signals that you have to weigh against one another.

How latency-sensitive is the request?
Does a 200 ms round trip break the user experience, or is nobody going to notice?
How large is the payload?
Are you shipping a short text prompt or a 4K video frame?
How specialized is the hardware you need?
Will any modern NVIDIA GPU do, or do you need a specific accelerator profile that only exists in a few regions?
How stable is the traffic pattern?
Is this a steady load you can provision for, or a spiky load that needs burst capacity somewhere elastic?

No single answer covers all of those questions. An agentic workflow making a dozen sequential model calls has a completely different placement profile than a batch embedding job that’s running overnight. A real-time recommendation query wants an edge node; a quarterly model retraining job wants a hyperscale region with deep accelerator pools.

Most production AI workloads are actually a pipeline of steps, and the right answer is often that different steps in the same pipeline belong in different places.

Placement decisions must change quickly when something shifts

This is where load balancing and orchestration stop being afterthoughts. If you're running inference across a distributed footprint, something has to decide which request goes to which node, to route around failures, and to shift traffic as regional capacity fluctuates.

The ecosystem of tools for this is still immature. Most teams end up writing a lot of the routing logic themselves, because the off-the-shelf options assume a single region or a single class of hardware. That gap, more than raw compute availability, is what I'd bet determines which distributed AI systems hold up over the next two years.

The providers that do this well won't necessarily be the ones with the most GPUs. They'll be the ones whose scheduling, routing, and observability across sites is good enough that placement decisions can be changed quickly when something shifts; for example, a shift in model size, traffic pattern, or regional capacity constraint. Flexibility at the control plane is what makes the underlying hardware useful.

Planning a distributed AI strategy

If you're designing AI infrastructure in 2026, three decisions are going to shape how well your system holds up:

Inference placement
Model lifecycle management
Accelerator utilization

Inference placement

Not every workload belongs at the edge. Some need the throughput of a regional data center. Some need the specialized accelerators of a hyperscale region. The work is figuring out where each piece lives, workload by workload. Based on what I’ve seen, I’d resist the temptation to pick one layer and force everything into it.

Model lifecycle management

Distributed inference makes lifecycle management harder. You need versioning, rollout, rollback, and observability across sites that may have different hardware, different network conditions, and different failure modes. If you're pushing models to hundreds of locations, the orchestration layer is as important as the inference layer.

Accelerator utilization

Once a model is loaded and warm, inference is fast. The hard problem stops being “Can we run this?” and becomes “Are we actually using the hardware we're paying for?” Utilization is a scheduling problem, a routing problem, and a workload-shaping problem — and where a lot of the cost optimization in the next few years is going to come from.

None of these decisions are solved yet. All three are active areas of engineering, for us and for everyone else in this space.

One last thought

If you take away nothing else from this blog post, take this fact: The single most useful question to ask about your AI system is not “Is it fast enough?” It's “Where is the bottleneck, and am I handling it in the right place?”

Bottlenecks are going to move. They always do. The teams that stay ahead are the ones who keep asking where the bottleneck is now, and have the architectural flexibility to move the work when the answer changes.

That's harder than picking a cloud region and hoping for the best. It's also, as far as I can tell, the only approach that holds up.

The practical work in this space is less glamorous than the AI inference headlines suggest.

It's routing decisions about which nodes should run inference for which requests.
It's figuring out when parallelism across regional data centers beats a single larger deployment, and when it doesn't.
It's deciding which workloads genuinely need real-time response at the edge and which are fine with a regional round trip.

Conclusion

Distributed inference, in practice, is a lot of small decisions about placement, scheduling, and failure modes, made over and over as traffic patterns and model sizes shift. The teams that treat those decisions as a first-class engineering problem, rather than something the cloud provider will figure out, are the ones whose systems still work at scale a year from now.

Learn more

If you're working on distributed inference architectures and want to dig into the specifics like GPU placement options, edge native serverless patterns, reference architectures for embedding-heavy pipelines, I’d suggest exploring Akamai Inference Cloud.

Learn more

May 27, 2026

Alex Leung

Written by

Alex Leung

View cloud pricing

Get started with Akamai Cloud

Get started with Security

Get started with Content Delivery

Security and Delivery

Cloud pricing

Cloud pricing

Try Akamai Cloud with US$100 in credits*

Get started with Akamai Cloud

Partners

Akamai Cloud

Akamai Security and Delivery

Distributed AI Inference: Why Placement Is the New Bottleneck

Executive summary

What actually slows down an AI system?

Case study: Image search

Why this matters for where you put compute

Edge and cloud are not competing

The infrastructure reality nobody wants to talk about

Distributed architecture to the rescue

Framework for workload placement decisions

Placement decisions must change quickly when something shifts

Planning a distributed AI strategy

Inference placement

Model lifecycle management

Accelerator utilization

One last thought

Conclusion

Learn more

Related Blog Posts