Akamai Inference Cloud with NVIDIA: A buyer’s brief for enterprise AI inference at the edge

This summary distills the Akamai Inference Cloud launch into the technical and operational details enterprise IT and MLOps leaders typically evaluate: architecture, performance, integrations, security, availability, and how to start.

What’s new

Akamai launched Akamai Inference Cloud — a distributed platform that extends AI inference from core data centers to the edge. It combines NVIDIA’s latest accelerated compute stack with Akamai’s globally distributed platform to deliver low-latency, real-time inference for agentic systems, streaming decisioning, and physical AI. Initial GPU regions are rolling out across 20 locations, with broader expansion planned. The platform operates alongside Akamai’s global edge network of more than 4,200 locations worldwide.

Read the launch details in the official press release and platform overview: - See the press release announcement on this page - Explore the Akamai Inference Cloud product page

Architecture and core capabilities

Built for agentic, real-time AI - Distributed inference: Routes requests to the closest suitable GPU region to reduce latency and improve time-to-first-token (TTFT) and tokens-per-second (TPS) for interactive workloads. See the architectural rationale in AI: Edge Is All You Need. - Agentic patterns: Designed for multi-step, tool-using agents that need memory, orchestration, and predictable low latency across sequences of calls.

NVIDIA accelerated infrastructure - GPUs: NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, available as single-GPU instances or scaled inference clusters. - DPUs: NVIDIA BlueField-3 DPUs for accelerated, secure data movement; roadmap includes BlueField-4. - Software: Platform is designed for NVIDIA AI Enterprise, including NIM inference microservices and NeMo microservices.

Compute, storage, and data - Elastic profiles: From hourly single-GPU rentals to high-performance clusters (up to 8 GPUs, 128 vCPUs, 1,472 GB DRAM, 8,192 GB NVMe) for high-concurrency inference. - Storage: High-performance block and object storage, plus managed vector databases to support RAG and agent memory. - Networking: VPC and private networking with global routing and traffic control at the edge.

Developer and platform operations - Managed Kubernetes: Akamai’s LKE (Linode Kubernetes Engine) for portable, conformant K8s. - App Platform: A pre-engineered, cloud-native stack to deploy LLMs, agents, and knowledge bases, with pre-integrated components such as KServe, Kubeflow Pipelines, vLLM, and OpenAI-compatible APIs. See the App Platform overview. - Edge/serverless: Functions at the edge for traffic steering, policy enforcement, and lightweight pre/post-processing.

Security and governance - Model- and API-aware controls: Firewall for AI, App & API Protector, API Security, and bot & abuse controls to mitigate scraping, prompt injection, model abuse, and DDoS. - Zero Trust segmentation: Akamai Guardicore Segmentation for granular east-west controls around model-serving infrastructure.

Performance signals

Lower latency by design: Inference executes closer to users/devices, improving TTFT and TPS for conversational and multi-step agent workflows. See the platform perspective in Enabling AI Everywhere.
Blackwell throughput: Independent Akamai testing found the NVIDIA RTX PRO 6000 Blackwell on Akamai Cloud delivered up to 1.63x higher inference throughput vs. H100 in selected scenarios. Review the results in the benchmarking analysis.

If you’re optimizing for consistent sub-100 ms end-to-end experiences, proximity routing plus GPU/DPU acceleration and edge traffic management are the primary levers available in this platform.

Integration highlights

NVIDIA AI Enterprise: Ready for NIM microservices and NeMo, enabling standardized, high-performance model serving and retrieval workflows.
Kubernetes-native MLOps: KServe for model serving and autoscaling; Kubeflow Pipelines for workflow orchestration; vLLM for efficient LLM serving.
API compatibility: OpenAI-compatible endpoints for rapid app portability to your own models and infrastructure when needed.
Agent lifecycle: Tools for routing, identity, quotas, and observability that support agent discovery, authorization, and safe execution at scale.

For background on how inference differs from training and why proximity matters, see the glossary guide: What is AI inferencing?

Priority use cases this platform targets

Agentic assistants and CX: Low-latency, multi-step conversations that blend RAG, tools, and orchestration.
Streaming finance and real-time decisioning: Fraud detection, payments, risk scoring, and market reactions where milliseconds matter.
Physical AI: Autonomous systems, robotics, and smart infrastructure that require real-time sensor processing and safety decisions.
Personalization and recommendations: Per-user inference with local context for dynamic content and commerce.

Availability and scale

GPU regions: Available now, targeting 20 initial locations; expansion underway.
Global edge network: More than 4,200 edge locations to route and manage AI traffic for proximity, consistency, and resilience.

Pricing and procurement cues

Consumption flexibility: Rent a single GPU by the hour or deploy multi-GPU clusters for high-throughput inference.
Cost control: The platform emphasizes proximity (to reduce egress and latency costs), clear pricing, and no-cost egress on Akamai cloud services as described on the product page.
Getting access: Be among the first to use NVIDIA RTX PRO 6000 Blackwell hardware; you can join the waitlist and book an AI consultation to match workloads to GPU profiles and deployment options.

Technical evaluation checklist

Architecture and performance - What are your TTFT/TPS targets per use case? Can proximity routing meet them across your user geos? - Do you need single-tenant clusters for predictable concurrency or shared pools for burst? - Will you use NIM/NeMo for model serving, or KServe/vLLM on K8s?

Security and governance - How will you mitigate prompt injection, scraping, and model abuse? Which edge controls (WAF/WAAP, Firewall for AI, bot management) are required on day one? - What’s your segmentation strategy for inference services and data paths (e.g., Guardicore microsegmentation)?

Data and integrations - Which vector DB and storage tiers do your RAG/agent workloads require? What are the data locality constraints? - Do you need OpenAI-compatible APIs for portability while migrating to your own models?

Operations and cost - How will you instrument observability for latency, quality, and cost per token/call? - What does your GPU mix look like across dev/test/prod and regions? How will you manage quotas, rate limits, and backpressure?

Recommended next steps and resources

Discuss architecture, regions, and GPU profiles with the team. Book a consultation.
Plan access to Blackwell GPUs. Join the waitlist.
Review the technical vision and benchmarks:
Read AI: Edge Is All You Need for the agentic web approach
See Enabling AI Everywhere for the platform strategy
Evaluate Blackwell benchmarking results

If you’re evaluating how to move inference from core to edge without adding complexity, Akamai’s combination of NVIDIA Blackwell, distributed routing, Kubernetes-native tooling, and AI-aware security gives you the main primitives to hit latency, scale, and governance targets for modern agentic applications.