Akamai to acquire LayerX to enforce AI usage control on any browser. Get details

Agentic Disconnect: The Latency Crisis Facing Modern AI Architecture

Jon Alexandar

Jun 24, 2026

Jon Alexander

Jon Alexandar

Written by

Jon Alexander

Jon Alexander is Senior Vice President of Product for the Cloud Technology Group at Akamai. He is responsible for the strategy, roadmap, and success of the cloud computing and delivery products. Jon joined Akamai in 2017 and led various product teams inside Akamai, starting within the media division. Previously, he worked in several roles focused on building large-scale internet infrastructure. Jon spent 10 years running the media business at one of the world’s largest telecommunications carriers and has led product teams at start-ups as they defined, launched, and grew new solutions. He is passionate about fostering innovation and building customer-centric product teams. He holds a Master of Arts degree and a Master of Engineering degree from Cambridge University.

Share

The technology sector has reached a critical architectural impasse. While many stakeholders envision a future of autonomous AI agents that anticipate needs in real time, the underlying infrastructure remains tethered to a centralized topology that makes this vision physically impossible.

The industry markets instantaneous machine intelligence while delivering it over a framework built for human patience — a fundamental mismatch that threatens the viability of the next generation of enterprise AI.

For three decades, the cloud was optimized for the human threshold, comfortably absorbing the 100-ms delays inherent in crosscontinental data transit. But an AI agent does not operate on a human clock.

How multi-agent AI frameworks compound latency 

Discussions around AI latency often focus on model optimization — the time required for a GPU to generate a token. Existing large language model (LLM)–serving engines optimize individual calls in isolation, while multi-agent frameworks focus on orchestration without system-level performance planning. 

CPU overhead and GPU idle time 

As a result, repeated prompts, overlapping contexts, and fragmented CPU-GPU execution create substantial redundancy and poor hardware utilization. Research has shown that CPU-side processing accounts for up to 90.6% of total latency and 44% of total dynamic energy in agentic workloads. A far more dangerous bottleneck exists within the agentic layer itself.  

Modern agents combine multistep reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. GPUs sit idle as long-duration tool calls running on CPU dominate the execution latency, often leading to high spikes in time to first token (TTFT) after key value (KV) cache is evicted. When each machine-to-machine call requires a round trip to a distant data center, transit time stacks aggressively.

According to Akamai’s The State of AI Inference 2026 report, a single workflow requiring 50 sequential calls quickly incurs seconds of transport latency; this is a threshold that renders AI applications unusable in production for the 82% of organizations whose critical use cases require end-to-end response times of 500 ms or less. 

The enterprise market is largely unprepared for this reality, as even tighter constraints are emerging: 64% of operators are now targeting 250 ms or less.

The structural bottleneck

The persistence of centralized compute fortresses is not a technical necessity, but a legacy business incentive. Hyperscaler models prioritize centralized storage to protect established revenue streams, which creates a bottleneck that hybrid “edge extensions” fail to solve.

These centralized systems work well for monolithic applications where users and data are all colocated.  However, for production agents where users are distributed, data is localized, and external APIs are regional, the centralized model fails. Centralization does not guarantee security; it guarantees delay. True Zero Trust security requires enforcement at the point of execution, not a distant cluster.

By defending an infrastructure physically too slow for the applications they claim to enable, incumbents force enterprises into a cycle of overprovisioning and unpredictable performance. This is no longer sustainable.

The path forward requires a compute continuum. We must build a seamless spectrum of intelligence that spans the distance from the massive centralized core to the exact millimeter of the network closest to the user. Distributed execution is the only way to bridge the agentic disconnect.

Decoupling brains from hands

Current industry fixation on the GPU overlooks a critical distinction: GPUs provide raw intelligence, but they do not provide execution. While deep reasoning belongs in the core, the operational orchestration of an agent belongs at the edge.

Calling tools, reading local files, and executing code securely does not require a hypercluster of GPUs. It requires high-performance CPUs connected by a high-speed private fiber backbone. 

An optimized blueprint for the agentic era relies on a clear division of labor:

  • Centralized core: Reserves massive resources for heavy reasoning and foundation model training

  • Regional edge: Deploys distributed GPU clusters for localized, low-latency inference

  • Distributed edge: Uses high-performance CPUs to execute tools and handle real-time data orchestration

In this topology, the GPU brings the intelligence, while the edge infrastructure coordinates the action. The edge turns raw model outputs into functional agentic behavior.

Solving the problem of the “World Wide Wait”

We have seen this play out before. In the late 1990s, the early internet suffered from the “World Wide Wait” because it relied solely on centralized web servers. The industry didn't solve the problem by making centralized servers bigger; it solved it by creating content delivery networks (CDNs) like Akamai to push content to the edge. 

We face the exact same inflection point today, but we are distributing compute rather than content.

Three architectural requirements for modern AI agents

This shift toward distributed compute will manifest first in high-value enterprise applications: personalization, real-time financial systems, and localized fraud detection. These are the exact use cases where latency directly impacts revenue per visitor and customer satisfaction.

When these applications fail under production loads, the conversation will shift. The important metric will no longer be which model has the highest benchmark score, but whose architecture functions in the physical world.

Three architectural recommendations drive success for modern agents:

  1. Dynamic tiered memory: Distributed memory tiers allow for graceful KV cache offload from HBM to DRAM to NVMe and seamless resumption

  2. Adaptive co-scheduling: Orchestrate fragmented handoffs from distributed CPU tool workers and regional GPU inference nodes to shorten end-to-end critical paths

  3. Distributed execution: Spread LLM execution shards and tool calls across diverse, decentralized compute nodes to maximize concurrent throughput and minimize latency 

Architect for distribution

Although the industry focuses heavily on token pricing,  the smart utilization of physically distributed GPU and CPU assets dictates viability. As AI workloads transition to interconnected systems, success belongs to the leaders who refuse to let the limits of centralized infrastructure dictate the speed of their business. The time to architect for distribution is now.

Jon Alexandar

Jun 24, 2026

Jon Alexander

Jon Alexandar

Written by

Jon Alexander

Jon Alexander is Senior Vice President of Product for the Cloud Technology Group at Akamai. He is responsible for the strategy, roadmap, and success of the cloud computing and delivery products. Jon joined Akamai in 2017 and led various product teams inside Akamai, starting within the media division. Previously, he worked in several roles focused on building large-scale internet infrastructure. Jon spent 10 years running the media business at one of the world’s largest telecommunications carriers and has led product teams at start-ups as they defined, launched, and grew new solutions. He is passionate about fostering innovation and building customer-centric product teams. He holds a Master of Arts degree and a Master of Engineering degree from Cambridge University.

Tags

Share

Related Blog Posts

Security
AI Reconnaissance: The Missing Layer in Chatbot Security
June 23, 2026
Read how Akamai threat researchers uncovered how attackers use benign-looking questions for AI reconnaissance, and why dynamic runtime guardrails are critical.
Security
DNS Is Your Most Critical — and Most Misconfigured — Security Control
June 18, 2026
DNS has evolved from a basic networking utility into a critical security control layer. Learn about the DNS misconfigurations that today’s attackers actively exploit.
Security
How Akamai Defended an Indian Bank Against Record-Breaking DDoS Attacks
June 17, 2026
Learn how Akamai successfully neutralized one of the largest DDoS attacks ever recorded in the Indian banking sector before a single customer was impacted.