The agentic web is a shift from static “click, fetch, render” experiences to applications where intelligent agents retrieve information, plan multi-step workflows, execute actions, and collaborate with other agents to deliver outcomes.
Key takeaways:
- Agentic applications are latency-bound. When results depend on dozens or hundreds of chained micro-inferences, small delays compound into brittle user experiences.
- Latency has measurable business impact. Akamai analytics indicate as little as 10–15 ms of added delay can increase abandonment in critical retail workflows.
- Inference, not training, is becoming the dominant AI workload. Inference is continuous and tied to user interactions, while training is bursty and cyclical.
- Akamai Cloud for Inference uses a 3-layer architecture. Centralized AI factories for training and heavyweight inference, distributed GPUs for real-time inference near users, and an edge routing and security layer to evaluate, secure, and route requests.
- Distributed GPUs reduce long-haul network travel and improve concurrency. Proximity compute enables millisecond-level responsiveness for latency-sensitive inference.
- The edge routing and security layer protects expensive GPU capacity. It validates and classifies requests, filters threats and bots, and routes to the optimal GPU location based on latency, cost, and availability.
- Real-world media workflows show why proximity matters. Examples include 8K VR broadcasting and near-real-time video decisions (around ~35 ms) enabled by distributed inference.
Architecting the Agentic Web
Frequently Asked Questions (FAQ)
Agentic experiences often depend on dozens or hundreds of chained micro-inferences per session. Even small delays stack up, making experiences slow and brittle.
Training is computationally heavy and bursty, typically run in discrete cycles. Inference is continuous, driven by user interactions, and can involve multiple dependent calls per engagement.
It is an infrastructure approach designed for real-time, distributed, latency-sensitive inference at global scale, using highly distributed GPUs combined with edge-native decisioning.
Centralized AI factories for training, fine-tuning, and heavyweight or “one-shot” inference.
A distributed GPU layer near users for real-time, latency-sensitive inference.
An edge routing and security layer to evaluate, secure, and route requests before they reach GPUs.
It validates and classifies incoming requests, filters threats and bots, handles token security and privacy-sensitive traffic, and routes requests to the best GPU location based on latency, cost, and availability.
Placing GPUs near population centers reduces latency, increases concurrency, and minimizes long-haul network travel, which is critical for real-time inference and agentic orchestration.
Workloads that need real-time responsiveness and run close to users or data, including agentic workflows, multimodal applications, and demanding media/video intelligence scenarios.
Akamai platform analytics suggest that 10–15 ms of added delay can increase abandonment during critical retail workflows, which becomes more pronounced when micro-inferences are chained.
It outlines phases: distributed inference enablement first, then real-time multimodal intelligence, then fully agentic applications that can retrieve data, plan tasks, and collaborate with other agents.