Enabling AI Everywhere with Akamai Inference Cloud
Akamai Inference Cloud moves AI from centralized data centers to the edge — where your users, devices, and decisions are. Built with NVIDIA’s latest AI infrastructure, it’s a full‑stack platform to deploy, secure, and scale real‑time inference and agentic applications globally with predictable latency and clear economics.
Executive summary
- What it is: A distributed inference platform combining GPU compute, traffic management, Kubernetes, data services, and AI‑aware security — designed for production AI at global scale.
- Why it matters: Closer‑to‑user inference cuts time to first token, improves tokens per second, and controls costs for high‑frequency, real‑time workloads.
- NVIDIA partnership: Access NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, BlueField DPUs, and NVIDIA AI Enterprise (including NIM and NeMo microservices) on Akamai’s globally distributed cloud.
- Who it serves: MLOps engineers, AI engineers, and agentic system architects building low‑latency, reliable AI experiences.
- How to start: Create an account, book an AI consult, or join the Blackwell GPU waitlist. Transparent pricing, open APIs, and no‑cost egress help you scale without lock‑in.
Why inference must move to the edge
Training teaches models; inference delivers value. As AI assistants, autonomous agents, and physical systems proliferate, the volume of machine‑initiated inference will far outpace human requests. Shipping every token across continents is costly and slow. Moving inference closer to users reduces latency, improves consistency, and makes the economics work for production.
If you’re feeling the pinch from egress fees, GPU scarcity, or unpredictable response times, the bottleneck isn’t just GPUs — it’s proximity. Edge inference solves for milliseconds, not megawatts.
What Akamai Inference Cloud is
Akamai Inference Cloud is a purpose‑built platform to run intelligent, real‑time applications at the edge. It brings together:
- NVIDIA Blackwell GPUs and BlueField networking for high‑throughput, low‑latency inference
- Kubernetes (managed or bring‑your‑own), with an application platform for rapid AI deployment
- Distributed data services (vector databases, object and block storage, backups)
- AI‑aware routing, rate limiting, semantic caching, and CDN acceleration
- Model‑aware security (Firewall for AI) plus WAAP, bot management, and microsegmentation
- Unified observability to monitor performance, cost, and reliability
Learn more on the Akamai Inference Cloud product page.
Architecture and how it works
- Compute and networking: NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, NVIDIA BlueField DPUs, tiered high‑speed memory, and NVMe storage deliver fast TTFT and high TPS for LLMs and multimodal models.
- Orchestration: Run on Akamai’s managed Kubernetes or your conformant cluster. The pre‑engineered App Platform on LKE streamlines serving (vLLM, KServe), pipelines (Kubeflow), and NVIDIA integrations (NIM, NeMo, TensorRT‑LLM).
- Distributed execution: Edge‑native traffic management routes requests to the optimal region or edge location based on proximity, capacity, policy, and cost.
- Enterprise controls: Identity, policy, quotas, and per‑model rate limits help you allocate and govern resources across teams and geographies.
- Data plane: Managed vector databases for retrieval, plus object/block storage and data pipelines to keep knowledge bases fresh and local to users.
For a deeper view of the NVIDIA integration and edge orchestration, see the press release.
Performance you can measure
- Lower latency where it matters: Edge inference reduces network hops, improving response times and consistency for chat, streaming reasoning, and agentic workflows.
- Benchmarked throughput: Independent testing on our platform shows the NVIDIA RTX PRO 6000 Blackwell can deliver up to 1.63× higher inference throughput vs. H100 in select workloads. See the benchmark analysis.
- Optimized for TTFT and TPS: From GPU scheduling to semantic caching at the edge, the stack is tuned for interactive and real‑time experiences.
New to inference fundamentals? See What is AI inferencing?
Security, governance, and reliability
- Model‑aware security: Firewall for AI mitigates prompt injection, prompt leakage, jailbreaks, and model abuse.
- WAAP + API protection: App & API Protector and API Security safeguard endpoints, traffic, and data exchange.
- Zero Trust segmentation: Guardicore Segmentation contains blast radius and enforces least privilege for AI services.
- Bot and abuse protection: Control scraping, fraud, and unwanted automated traffic to AI endpoints.
- Observability: End‑to‑end telemetry across models, APIs, and edge to track latency, quality, spend, and SLOs.
Who it’s for
- MLOps engineers: Automate deployment, scaling, retraining, and monitoring of models in production with Kubernetes‑native tooling.
- AI engineers: Build end‑to‑end applications and RAG pipelines using pre‑trained models, vLLM/KServe, managed vector DBs, and NVIDIA NIMs.
- Agentic system architects: Design multi‑agent systems with identity, trust, tools, and memory that operate autonomously across regions and edges.
More context on personas and design goals: AI: Edge Is All You Need.
Implementation: from first model to global rollout
- Choose your path:
- Managed platform: Deploy on App Platform for LKE to get serving, pipelines, vector DBs, and NVIDIA integrations out of the box.
- Bring your own K8s: Run on any conformant cluster with full control and open APIs.
- Connect enterprise data: Stand up managed vector search and object storage; configure RAG pipelines and data pipelines for freshness.
- Secure the interaction layer: Apply model‑aware security, WAAP, quotas, and per‑team limits. Configure observability and SLOs.
- Scale to the edge: Use policy‑based routing and edge caching to place inference near users or devices and keep costs predictable.
- Validate and expand: Load test with realistic traffic, monitor TTFT/TPS and cost per request, then roll out region by region.
Need help planning? Book an AI consultation.
Key features and capabilities
- NVIDIA Blackwell GPUs and BlueField DPUs
- Managed Kubernetes and App Platform for agentic apps
- vLLM, KServe, Kubeflow Pipelines, NVIDIA NIM and NeMo
- Managed vector databases and distributed object/block storage
- Edge‑aware traffic management, semantic caching, and quotas
- Firewall for AI, WAAP, API Security, bot and abuse protection
- Unified observability across models, APIs, edge, and cost
Pricing, trials, and getting started
- Create your account: Create a Cloud Account to provision compute, storage, and networking.
- Access Blackwell GPUs: Be among the first to use NVIDIA RTX PRO 6000 Blackwell optimized for inference — join the waitlist.
- Talk to an expert: Share your use case and performance goals to match the right GPU tiers and deployment model — book an AI consultation.
- Control costs: Build on open APIs and full Kubernetes control with no‑cost egress and clear pricing. For optimization strategies, see the whitepaper, AI Inference Efficiency: Spend Less and Do More (Stable Diffusion case study), which showed up to 86% cost reduction on Akamai Cloud. Download the whitepaper.
Where it excels
- Agentic AI and assistants: Streaming, low‑latency responses with guardrails and per‑model quotas.
- Financial decisioning and streaming inference: Multi‑step reasoning with millisecond responsiveness.
- Personalization and recommendations: Real‑time, context‑rich interactions with your LLMs and custom models.
- Physical AI: Robots, vehicles, and industrial edges that require on‑device and near‑device inference.
Resources and next steps
Ready to move from pilot to production? Book an AI consultation and we’ll help you design, secure, and scale your inference stack at the edge.