Why GPU Card Counts Matter for Real AI Workloads

Mar 03, 2026

Written by

Arshad Khan is a Senior Product Manager at Akamai.

When organizations move artificial intelligence (AI) from experiment to production, they discover something critical: Not every workload needs the biggest GPU you can buy.

The challenge isn’t access to GPUs. It’s having the right GPU shape for the job.

Some teams need just enough GPU to fine-tune a model or power a recommendation engine. Others need significantly more memory and throughput for multimodal inference, 8K video transcoding, or AAA game titles support.

With NVIDIA RTX PRO™ 6000 Blackwell Server Edition GPUs now available in 1-card, 2-card, and 4-card plans, Akamai Inference Cloud meets customers where their workloads actually are, delivering the right price-to-performance ratio for real AI inference, agentic AI, physical AI, scientific computing, media, and video games use cases.

These plans are designed for teams that don’t just want GPU access. They also want GPU infrastructure that matches how modern applications are built and deployed.

Not sure what GPU you need? Check out our blog post on comparing GPUs on Akamai Cloud.

What different GPU card counts enable

The number of GPU cards in a plan directly impacts available memory, parallelism, and throughput, which translates into very different classes of workloads. The three plans include:

1-card plans — Enable precision for focused AI, media, and edge inference
2-card plans — Expand the class of workloads for multimodal and agentic AI systems
4-card plans — Run the largest models without training-class infrastructure

1-card plans — Enable precision for focused AI, media, and edge inference

A single NVIDIA RTX PRO 6000 Blackwell Server Edition GPU is not an entry-level option. It is an extremely efficient shape for a wide range of real production workloads.

With 96 GB of VRAM, this plan enables teams to run:

Up to 70B parameter models using FP4 quantization
Up to 40B parameter models using FP8 quantization
Multiple concurrent instances of 7B–13B parameter models on the same GPU

The 1-card plan is purpose-built for deploying large language models, recommendation engines, multimodal inference, safety systems, transcoding pipelines, and AI services embedded directly into applications.

This is where the price-performance advantage becomes undeniable. Compared with H100 running similar workloads, NVIDIA RTX PRO 6000 Blackwell Server Edition GPU provides:

28% lower cost per million tokens
Higher throughput (3,140 vs. 2,987 tokens/sec)
FP4 support delivering 1.63x performance improvement over H100 at FP8

These gains are a direct result of the Blackwell architecture’s native FP4 support and memory profile, which allow teams to run larger models or more concurrent workloads without scaling up to oversized, expensive GPU instances.

For many AI builders, media platforms, and software companies, this is the operational sweet spot: enough GPU to run serious models efficiently, without paying for infrastructure designed for training clusters or massive parallelism.

2-card plans — Expand the class of workloads for multimodal and agentic AI systems

Two RTX PRO 6000 Blackwell Server Edition GPUs dramatically expand the class of workloads that teams can run without stepping into oversized, data center-class GPU infrastructure.

This configuration is optimized for platform and product teams that are building:

Multimodal AI systems combining vision, text, and audio
Agentic AI workflows that maintain context across multistep tasks
Higher-throughput inference for web, mobile, and enterprise applications
AI moderation, safety, and real-time decision systems

With combined GPU memory, teams can run:

Models on single-GPU limits using FP4 quantization
Larger context windows for inference
More demanding real-time workloads without sacrificing latency

At this level, performance remains on par with traditional data center GPUs:

Nearly identical cost per token as H100/H200 class infrastructure
Strong throughput for real-world inference workloads, while still benefiting from the price-performance and geographic advantages of the Akamai distributed edge and cloud platform

For many AI builders, this is where multimodal and agentic applications move from experimental to production grade.

4-card plans — Run the largest models without training-class infrastructure

Four RTX PRO 6000 Blackwell Server Edition GPUs unlock the ability to run large open models in production, without requiring specialized centralized AI factory environments.

This plan is designed for:

Large-scale enterprise inference systems
High-throughput media and transcoding pipelines
AAA games platforms with AI-driven services
Software platforms that are delivering AI capabilities to customers globally

In this configuration, teams can run:

Models requiring 384 GB of memory supporting a large number parameter models
Models such as Qwen3-Coder-480B in FP4, which requires approximately 320 GB of GPU memory across four GPUs

At this scale, the architecture begins to approach the practical limits of PCIe-based GPU communication, and raw throughput starts to favor NVLink-based H100/H200 systems by roughly 30% to 50%.

For inference workloads, however, the economics and deployment flexibility remain compelling.

The cost efficiency remains nearly identical to that of data center GPUs.
These large models can be deployed in locations that matter for latency.
There is no need to move workloads into centralized AI factory regions.

This makes the 4-card plan uniquely suited for teams that need to run very large models in production, closer to users, without investing in specialized training infrastructure.

Who these GPU shapes are built for

These GPU configurations are particularly well suited for:

Digital-native AI builders developing agentic and multimodal products for real-world use
Product and platform teams embedding AI inference into web, mobile, device, and enterprise applications
Media and entertainment companies performing transcoding, supporting 8K streaming, and powering AAA video games titles
Enterprises deploying real-time AI systems such as recommendations, copilots, analytics, and safety workflows
Software companies delivering AI-powered platforms that require reliable, low-latency inference at global scale

From GPU instances to a distributed AI platform

Akamai Inference Cloud provides access to NVIDIA RTX PRO 6000 Blackwell Server Edition across its distributed cloud, forming the foundation of a broader, edge native AI platform that enables customers to run AI inference and accelerated workloads closer to end users.

Blackwell GPUs on Akamai Inference Cloud are designed for teams that are ready to run real AI workloads on GPUs with strong price-performance characteristics in locations that matter for latency, and to see how this distributed compute layer connects with Akamai’s evolving serverless and edge delivery capabilities.

Customers gain direct access to distributed GPU infrastructure paired with a production-ready Kubernetes environment (LKE) and an AI software stack, making it possible to deploy, scale, and operate inference workloads without building custom GPU infrastructure.

The focus is on transparent pricing, ongoing performance improvements, and enabling inference-driven workloads that centralized cloud regions often struggle to support efficiently — while signaling the evolution from standalone GPU instances to an integrated, globally distributed intelligence platform.

Why this matters

For organizations that are scaling AI initiatives but constrained by the cost, scarcity, and latency of centralized hyperscalers, Akamai Inference Cloud provides a globally distributed AI compute foundation.

The RTX PRO 6000 Blackwell Server Edition GPUs serve as the bedrock for this AI platform evolution, combining high-performance GPU instances, managed Kubernetes, and an AI-optimized software stack with the geographic reach required for sub–10 ms experiences.

Because in AI infrastructure, just like compute plans, the shape is what determines whether AI runs efficiently or expensively.

Get access