This summary distills the Akamai Inference Cloud launch into the technical and operational details enterprise IT and MLOps leaders typically evaluate: architecture, performance, integrations, security, availability, and how to start.
Akamai launched Akamai Inference Cloud — a distributed platform that extends AI inference from core data centers to the edge. It combines NVIDIA’s latest accelerated compute stack with Akamai’s globally distributed platform to deliver low-latency, real-time inference for agentic systems, streaming decisioning, and physical AI. Initial GPU regions are rolling out across 20 locations, with broader expansion planned. The platform operates alongside Akamai’s global edge network of more than 4,200 locations worldwide.
Read the launch details in the official press release and platform overview: - See the press release announcement on this page - Explore the Akamai Inference Cloud product page
Built for agentic, real-time AI - Distributed inference: Routes requests to the closest suitable GPU region to reduce latency and improve time-to-first-token (TTFT) and tokens-per-second (TPS) for interactive workloads. See the architectural rationale in AI: Edge Is All You Need. - Agentic patterns: Designed for multi-step, tool-using agents that need memory, orchestration, and predictable low latency across sequences of calls.
NVIDIA accelerated infrastructure - GPUs: NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, available as single-GPU instances or scaled inference clusters. - DPUs: NVIDIA BlueField-3 DPUs for accelerated, secure data movement; roadmap includes BlueField-4. - Software: Platform is designed for NVIDIA AI Enterprise, including NIM inference microservices and NeMo microservices.
Compute, storage, and data - Elastic profiles: From hourly single-GPU rentals to high-performance clusters (up to 8 GPUs, 128 vCPUs, 1,472 GB DRAM, 8,192 GB NVMe) for high-concurrency inference. - Storage: High-performance block and object storage, plus managed vector databases to support RAG and agent memory. - Networking: VPC and private networking with global routing and traffic control at the edge.
Developer and platform operations - Managed Kubernetes: Akamai’s LKE (Linode Kubernetes Engine) for portable, conformant K8s. - App Platform: A pre-engineered, cloud-native stack to deploy LLMs, agents, and knowledge bases, with pre-integrated components such as KServe, Kubeflow Pipelines, vLLM, and OpenAI-compatible APIs. See the App Platform overview. - Edge/serverless: Functions at the edge for traffic steering, policy enforcement, and lightweight pre/post-processing.
Security and governance - Model- and API-aware controls: Firewall for AI, App & API Protector, API Security, and bot & abuse controls to mitigate scraping, prompt injection, model abuse, and DDoS. - Zero Trust segmentation: Akamai Guardicore Segmentation for granular east-west controls around model-serving infrastructure.
If you’re optimizing for consistent sub-100 ms end-to-end experiences, proximity routing plus GPU/DPU acceleration and edge traffic management are the primary levers available in this platform.
For background on how inference differs from training and why proximity matters, see the glossary guide: What is AI inferencing?
Architecture and performance - What are your TTFT/TPS targets per use case? Can proximity routing meet them across your user geos? - Do you need single-tenant clusters for predictable concurrency or shared pools for burst? - Will you use NIM/NeMo for model serving, or KServe/vLLM on K8s?
Security and governance - How will you mitigate prompt injection, scraping, and model abuse? Which edge controls (WAF/WAAP, Firewall for AI, bot management) are required on day one? - What’s your segmentation strategy for inference services and data paths (e.g., Guardicore microsegmentation)?
Data and integrations - Which vector DB and storage tiers do your RAG/agent workloads require? What are the data locality constraints? - Do you need OpenAI-compatible APIs for portability while migrating to your own models?
Operations and cost - How will you instrument observability for latency, quality, and cost per token/call? - What does your GPU mix look like across dev/test/prod and regions? How will you manage quotas, rate limits, and backpressure?
If you’re evaluating how to move inference from core to edge without adding complexity, Akamai’s combination of NVIDIA Blackwell, distributed routing, Kubernetes-native tooling, and AI-aware security gives you the main primitives to hit latency, scale, and governance targets for modern agentic applications.