What is the 3-layer architecture described in the whitepaper?

1. Centralized AI factories for training, fine-tuning, and heavyweight or “one-shot” inference. 2. A distributed GPU layer near users for real-time, latency-sensitive inference. 3. An edge routing and security layer to evaluate, secure, and route requests before they reach GPUs.

Architecting the Agentic Web

Agentic applications are latency-bound. When results depend on dozens or hundreds of chained micro-inferences, small delays compound into brittle user experiences.
Latency has measurable business impact. Akamai analytics indicate as little as 10–15 ms of added delay can increase abandonment in critical retail workflows.
Inference, not training, is becoming the dominant AI workload. Inference is continuous and tied to user interactions, while training is bursty and cyclical.
Akamai Cloud for Inference uses a 3-layer architecture. Centralized AI factories for training and heavyweight inference, distributed GPUs for real-time inference near users, and an edge routing and security layer to evaluate, secure, and route requests.
Distributed GPUs reduce long-haul network travel and improve concurrency. Proximity compute enables millisecond-level responsiveness for latency-sensitive inference.
The edge routing and security layer protects expensive GPU capacity. It validates and classifies requests, filters threats and bots, and routes to the optimal GPU location based on latency, cost, and availability.
Real-world media workflows show why proximity matters. Examples include 8K VR broadcasting and near-real-time video decisions (around ~35 ms) enabled by distributed inference.
Architecting the Agentic Web

Unfortunately the browser/OS you are accessing this page from does not support this functionality. You can access the PDF here

Frequently Asked Questions (FAQ)

The agentic web is a shift from static “click, fetch, render” experiences to applications where intelligent agents retrieve information, plan multi-step workflows, execute actions, and collaborate with other agents to deliver outcomes.

Agentic experiences often depend on dozens or hundreds of chained micro-inferences per session. Even small delays stack up, making experiences slow and brittle.

Training is computationally heavy and bursty, typically run in discrete cycles. Inference is continuous, driven by user interactions, and can involve multiple dependent calls per engagement.

It is an infrastructure approach designed for real-time, distributed, latency-sensitive inference at global scale, using highly distributed GPUs combined with edge-native decisioning.

Centralized AI factories for training, fine-tuning, and heavyweight or “one-shot” inference.
A distributed GPU layer near users for real-time, latency-sensitive inference.
An edge routing and security layer to evaluate, secure, and route requests before they reach GPUs.

It validates and classifies incoming requests, filters threats and bots, handles token security and privacy-sensitive traffic, and routes requests to the best GPU location based on latency, cost, and availability.

Placing GPUs near population centers reduces latency, increases concurrency, and minimizes long-haul network travel, which is critical for real-time inference and agentic orchestration.

Workloads that need real-time responsiveness and run close to users or data, including agentic workflows, multimodal applications, and demanding media/video intelligence scenarios.

Akamai platform analytics suggest that 10–15 ms of added delay can increase abandonment during critical retail workflows, which becomes more pronounced when micro-inferences are chained.

It outlines phases: distributed inference enablement first, then real-time multimodal intelligence, then fully agentic applications that can retrieve data, plan tasks, and collaborate with other agents.

Akamai Cloud

Akamai Security

Our global infrastructure

Architecting the Agentic Web

Key takeaways:

Frequently Asked Questions (FAQ)

What is the “agentic web”?

Why does agentic AI require ultra-low latency?

How is inference different from training?

What is Akamai Cloud for Inference?

What is the 3-layer architecture described in the whitepaper?

What does the edge routing and security layer do for AI inference?

Why distribute GPUs geographically instead of relying on centralized cloud regions?

What kinds of workloads benefit most from distributed inference?

What is the business impact of small latency increases?

How does the whitepaper describe the evolution toward agentic applications?