Executive summary
- The shifting landscape of AI infrastructure reveals that bottlenecks are no longer found in raw compute, but in inference placement.
- As models scale, a unified, three-layer architecture (including hyperscale cloud, regional data centers, and edge nodes) is replacing the traditional “cloud vs. edge” debate.
- Because preprocessing and embedding are now primary bottlenecks, compute must live near the data source to reduce bandwidth costs.
- Distributed architectures mitigate power, cooling, and water use limits by spreading thermal loads across smaller facilities.
- Success depends on “placement flexibility” — the ability to route workloads based on payload size, hardware needs, and traffic spikes.
- Ultimately, maintaining a viable AI system requires a flexible control plane that can adapt as bottlenecks inevitably migrate across the infrastructure stack.
In April 2026, I spoke on a panel at the Taiwan Cloud & Datacenter Convention about edge AI, why intelligence is moving closer to data, and what that shift means for infrastructure. The conversation covered a lot of ground, but one idea kept surfacing, and it's the one I want to spend some time on here:
In real AI systems, bottlenecks don't disappear. They move.
That sounds like a minor observation. It isn't. It's the thing that determines whether your inference architecture remains viable at scale or quietly collapses under its own weight six months after you ship.
What actually slows down an AI system?
When most teams start building an AI-powered application, they expect the model itself to be the expensive part. They believe that because training is expensive, inference must be too. And since GPUs are expensive, GPU-bound work must be the bottleneck. That intuition is reasonable, and for a while it was correct.
It's no longer correct in most of the systems I've looked at recently.
Case study: Image search
For example, we built an image search application — the kind of thing in which a user uploads a photo and the system returns visually similar results. It's a standard architecture: Generate an embedding for the query, run a vector search against an index, and return the top matches.
When we profiled it, the costs broke down roughly like this:
- Vector search latency: Negligible
- Embedding generation: Dominant cost
- Data movement and preprocessing: Nontrivial
Retrieval was fast. The GPUs were powerful. The model was fine, but the bottleneck had moved to embedding, the step where raw input gets converted into the vector representation the rest of the system operates on.
This pattern isn't unique to image search. Across retrieval-augmented generation (RAG), recommendation systems, multimodal pipelines, and agentic workflows, the story is similar: The parts of the stack we spent a decade optimizing are no longer the slow parts. The slow parts are the ones closest to the data.
Why this matters for where you put compute
Once you accept that bottlenecks move, the follow-up question is the interesting one: Where should the expensive work actually happen?
If embedding is the dominant cost, and embedding operates on raw input data, then the compute wants to be near the data. Moving raw data to a distant region to embed it, then moving the embeddings back, wastes bandwidth, adds latency, and compounds at every step.
The traditional drivers for edge computing still apply:
- Latency: Users notice
- Bandwidth efficiency: Moving data costs money
- Data locality: Some data can't or shouldn't leave a region
But the newer driver — and, in my view, the more decisive one for AI — is computational placement. You want the expensive step to happen where it's cheapest to run. And for embedding-heavy workloads, that's close to the source.
This architectural pattern is one of the reasons Akamai Cloud is built the way it is. We have a globally distributed footprint with compute available near where data is generated. This is exactly the shape the workload wants.
Edge and cloud are not competing
One of the more outdated framings in this space is “edge vs. cloud.” Today, that’s the wrong frame.
According to the panelists, the cleaner way to view the market is is through a unified, three-layer architecture model:
- Hyperscale cloud for model training: You need massive parallelism and the ability to tolerate long-running jobs.
- Regional data centers for large-scale inference: You need serious compute but latency to the user is only moderately sensitive.
- Edge nodes for real-time, latency-sensitive decisions: The round trip to a region is the limiting factor.
Each layer optimizes for a different constraint.
- Training wants density.
- Regional inference wants throughput.
- Edge inference wants proximity.
You don't pick one; you decide which workloads belong at which layer, and you architect for all three.
The teams that are getting this right aren't asking “cloud or edge?” They're asking “Which parts of this pipeline belong where?” That's a harder question to answer, but it's the correct one.
The infrastructure reality nobody wants to talk about
Here's the part of the panel discussion that got the most nods from the infrastructure people in the room — and the least attention from the AI-framing crowd: Power and cooling are becoming the binding constraint.
Edge sites in traditional edge computing deployments were historically CPU-dominated. GPUs change that. A rack of modern accelerators draws more power and generates more heat than anything most edge facilities were designed for. Water use, which almost nobody was thinking about five years ago, is now a real consideration in certain regions.
The implication is uncomfortable for anyone planning an AI infrastructure strategy around “We'll just build bigger data centers." You can't build them fast enough, and in some places you can't build them at all. Power connections can take years. Cooling capacity is finite. Permitting is slow.
Distributed architecture to the rescue
A distributed architecture helps here in a way that's easy to miss. When you spread workloads across many smaller facilities rather than concentrating them in a few megasites, you're not only distributing compute but also distributing thermal load, power draw, and water use. Each individual site stays within its envelope. The aggregate capacity scales without any single facility having to triple in size.
This is a genuinely different answer to the capacity problem than building another hyperscale region, and it's one of the few arguments for distributed infrastructure that remains even if you don't care about latency.
Framework for workload placement decisions
The abstract version of this is easy: Put the work where it’s cheapest to run. The operational version is harder, because “cheapest” depends on what you're optimizing for on any given day.
In practice, placement for inference workloads comes down to a handful of signals that you have to weigh against one another.
- How latency-sensitive is the request?
- Does a 200 ms round trip break the user experience, or is nobody going to notice?
- How large is the payload?
- Are you shipping a short text prompt or a 4K video frame?
- How specialized is the hardware you need?
- Will any modern NVIDIA GPU do, or do you need a specific accelerator profile that only exists in a few regions?
- How stable is the traffic pattern?
- Is this a steady load you can provision for, or a spiky load that needs burst capacity somewhere elastic?
No single answer covers all of those questions. An agentic workflow making a dozen sequential model calls has a completely different placement profile than a batch embedding job that’s running overnight. A real-time recommendation query wants an edge node; a quarterly model retraining job wants a hyperscale region with deep accelerator pools.
Most production AI workloads are actually a pipeline of steps, and the right answer is often that different steps in the same pipeline belong in different places.
Placement decisions must change quickly when something shifts
This is where load balancing and orchestration stop being afterthoughts. If you're running inference across a distributed footprint, something has to decide which request goes to which node, to route around failures, and to shift traffic as regional capacity fluctuates.
The ecosystem of tools for this is still immature. Most teams end up writing a lot of the routing logic themselves, because the off-the-shelf options assume a single region or a single class of hardware. That gap, more than raw compute availability, is what I'd bet determines which distributed AI systems hold up over the next two years.
The providers that do this well won't necessarily be the ones with the most GPUs. They'll be the ones whose scheduling, routing, and observability across sites is good enough that placement decisions can be changed quickly when something shifts; for example, a shift in model size, traffic pattern, or regional capacity constraint. Flexibility at the control plane is what makes the underlying hardware useful.
Planning a distributed AI strategy
If you're designing AI infrastructure in 2026, three decisions are going to shape how well your system holds up:
- Inference placement
- Model lifecycle management
- Accelerator utilization
Inference placement
Not every workload belongs at the edge. Some need the throughput of a regional data center. Some need the specialized accelerators of a hyperscale region. The work is figuring out where each piece lives, workload by workload. Based on what I’ve seen, I’d resist the temptation to pick one layer and force everything into it.
Model lifecycle management
Distributed inference makes lifecycle management harder. You need versioning, rollout, rollback, and observability across sites that may have different hardware, different network conditions, and different failure modes. If you're pushing models to hundreds of locations, the orchestration layer is as important as the inference layer.
Accelerator utilization
Once a model is loaded and warm, inference is fast. The hard problem stops being “Can we run this?” and becomes “Are we actually using the hardware we're paying for?” Utilization is a scheduling problem, a routing problem, and a workload-shaping problem — and where a lot of the cost optimization in the next few years is going to come from.
None of these decisions are solved yet. All three are active areas of engineering, for us and for everyone else in this space.
One last thought
If you take away nothing else from this blog post, take this fact: The single most useful question to ask about your AI system is not “Is it fast enough?” It's “Where is the bottleneck, and am I handling it in the right place?”
Bottlenecks are going to move. They always do. The teams that stay ahead are the ones who keep asking where the bottleneck is now, and have the architectural flexibility to move the work when the answer changes.
That's harder than picking a cloud region and hoping for the best. It's also, as far as I can tell, the only approach that holds up.
The practical work in this space is less glamorous than the AI inference headlines suggest.
- It's routing decisions about which nodes should run inference for which requests.
- It's figuring out when parallelism across regional data centers beats a single larger deployment, and when it doesn't.
- It's deciding which workloads genuinely need real-time response at the edge and which are fine with a regional round trip.
Conclusion
Distributed inference, in practice, is a lot of small decisions about placement, scheduling, and failure modes, made over and over as traffic patterns and model sizes shift. The teams that treat those decisions as a first-class engineering problem, rather than something the cloud provider will figure out, are the ones whose systems still work at scale a year from now.
Learn more
If you're working on distributed inference architectures and want to dig into the specifics like GPU placement options, edge native serverless patterns, reference architectures for embedding-heavy pipelines, I’d suggest exploring Akamai Inference Cloud.
Tags