Akamai to acquire LayerX to enforce AI usage control on any browser. Get details

Your AI Cost Model Stops at the Token Price. The Bill Doesn't.

Ari Weil is Vice President of Cloud Computing and Delivery Product Marketing at Akamai.

Jun 25, 2026

Ari Weil

Ari Weil is Vice President of Cloud Computing and Delivery Product Marketing at Akamai.

Written by

Ari Weil

Ari Weil is a product strategy and go-to-market executive with experience across various management and operational disciplines. He brings more than 20 years of cross-functional enterprise management expertise, across every aspect of the product and marketing lifecycle, to his role. His key areas of focus include data security, compliance, risk management, cloud adoption, digital transformation, and modern application architectures.

Share

AI spending has crossed a line that should change how you budget for it. Inference, not training, is now the dominant cost center, running close to 80% of total AI spend for any team with real production traffic. Training is the headline. Inference is the operating expense, and it shows up every time a model serves a request.

The hidden costs beyond the token price

Most enterprises model that expense as a compute problem. The metric of record is dollars per million tokens, and for good reason: GPU time is the majority of the bill, driven by model size, token volume, and how well you use the hardware. That math is correct as far as it goes; the problem is where it stops.

The line items that teams consistently miss sit one layer down: in data movement and latency. Egress and cross-region transfer are the clearest examples. Enterprises routinely underforecast these costs by three to five times, because the token price is printed on the pricing page and the network charges are not. 

Data movement does not rival compute today, and I am not going to tell you it does. But it is the fastest-growing category in the bill, and it behaves differently than compute. It scales with the shape of your architecture, not just with the size of your model.

How agentic and RAG systems reshape the bill

That distinction matters more every quarter, because the workloads are changing shape. A first-generation AI deployment was one prompt, one model call, and one answer. An agentic or retrievalaugmented generation (RAG) system looks nothing like that. A single user request fans out: a vector search for context, several model calls as the agent reasons and revises, tool and API calls to external systems, and then a response. 

Every hop that crosses a zone, a region, or a cloud boundary is a charge, and a place where latency compounds. The token price stays roughly flat. However, the number of hops behind each user action increases by an order of magnitude.

High availability makes the same point from a different direction. Run inference in a single region and you accept the latency penalty for distant users. Replicate across regions to fix that, and you duplicate the most expensive thing you own. Cross-region replication for latency doubles the GPU cost of the capacity that you replicate. You are now paying twice for compute to paper over a problem that is fundamentally about distance.

This is the part that the centralized model handles poorly. Concentrating inference in a handful of regions made sense when AI was a batch job that you ran overnight. It does not fit workloads that are interactive, high fan-out, and latency sensitive. 

A fraud system intercepting a transaction, a voice agent that needs to sound human, a retail experience personalizing in real time: None of these can absorb a round trip to a distant region. And none of them should pay a premium to move every byte of that interaction across the network.

Decoupling inference from centralized clouds

The answer is to stop treating inference as something that lives in three places and start running it where the workload actually is. That is the premise behind Akamai Inference Cloud, which we launched in October 2025 on NVIDIA Blackwell infrastructure: Extend inference from core data centers out to the edge, closer to users and devices, and across the distributed footprint Akamai already operates inside networks worldwide. 

Put the model near the request and two costs fall at once. Latency drops because the round trip is shorter. The last-mile transfer cost drops because the data no longer traverses the public internet from a distant region to reach the user.

Evaluating an AI vendor: Three critical questions

For enterprises, the practical move is no longer choosing a vendor but fixing the model you use to evaluate one. Three questions are worth posing to any AI infrastructure provider, including:

  1. What does it cost to move data inside the environment?
  2. What does it cost to leave?
  3. Where does the inference actually run?

What does it cost to move data?

The first question to consider is “What does it cost to move data inside the environment?” Charging for traffic between zones in your own deployment turns every distributed-agent design into a variable, hard-to-forecast bill. That cost should be flat and predictable, not metered at the boundary.

What does it cost to leave?

Second, consider asking “What does it cost to leave?” Hyperscaler egress costs run roughly 8 to 12 cents per gigabyte, and that number is the mechanism that keeps data where it lands. 

A neutral architecture lets you store data in one place, run retrieval in another, and serve inference where it makes sense, without an exit penalty engineered to prevent exactly that. For example, Akamai prices transfer on a flat, pooled model that runs close to an order of magnitude lower per gigabyte.

By switching from standard hyperscaler setups to an open, distributed cloud platform, enterprises running interactive inference workloads can reduce structural infrastructure costs by up to 86%.

Where does the inference actually run?

Finally, consider the question “Where does the inference actually run?” If the answer is always a central region, you’ve accepted that both the latency and the transfer costs are fixed. They are not fixed. They are architectural choices.

None of this requires believing that the centralized cloud is going away. It is not. Training will keep concentrating in large clusters, and plenty of inference will run happily in a region near the user. 

The argument is narrower and more useful than that. As AI workloads get more interactive and more chatty, the cost of moving data and the cost of distance stop being rounding errors and start being design decisions.

The underlying architecture decides the bill

The token price is what everyone quotes, but the underlying architecture is what decides the bill. The teams that win the next phase of AI will be the ones who modeled the whole cost, and built for where inference actually needs to happen.

Ari Weil is Vice President of Cloud Computing and Delivery Product Marketing at Akamai.

Jun 25, 2026

Ari Weil

Ari Weil is Vice President of Cloud Computing and Delivery Product Marketing at Akamai.

Written by

Ari Weil

Ari Weil is a product strategy and go-to-market executive with experience across various management and operational disciplines. He brings more than 20 years of cross-functional enterprise management expertise, across every aspect of the product and marketing lifecycle, to his role. His key areas of focus include data security, compliance, risk management, cloud adoption, digital transformation, and modern application architectures.

Tags

Share

Related Blog Posts

AI
Agentic Disconnect: The Latency Crisis Facing Modern AI Architecture
June 24, 2026
Centralized public clouds are creating an architectural bottleneck for agentic AI. Discover how Akamai’s distributed cloud solves the multi-agent latency crisis.
AI
Stop Treating Your LLMs Like Web Servers
June 18, 2026
Stop treating your self-hosted LLMs like web servers. Discover why AI inference fails silently under load, how GPU memory bottlenecks, and how to fix the problem.