AI gains rarely reach the profit and loss statement, and that is usually an architectural problem, not a model problem. Teams ship a working pilot, then watch the returns get eaten by latency, runaway compute, and security incidents that nobody budgeted for. The model performs. But the underlying infrastructure was never built to carry the workload into production.
Akamai commissioned two studies to pressure test where that breakdown happens. The State of AI Inference 2026 surveyed 200 practitioners who are running inference in production, three-quarters of them engineers and architects, and most of them deployment decision-makers. The 2026 API Security Impact Study surveyed 1,840 security professionals across six industries and 10 countries.
Together, the two reports point to one conclusion: Most teams are scaling on infrastructure designed for training and stretched to cover inference, and the gaps show up as cost, latency, and exposure.
That gaps close when three things come together:
An architecture that adapts to live conditions
Security built into that architecture instead of bolted on
Inference placed close enough to users to hit real-time targets
None of these is novel on its own. The problem is treating them as separate projects owned by separate teams.
Start with the anatomy. Inference is the live step where a trained model takes new input and returns an output. Every inference call travels as an API call, so the model and the API are not separate concerns. They share the same request path, the same failure modes, and the same attack surface. That thread runs through both surveys.
Where centralization truly breaks
Centralized inference is not doomed. For training, batch jobs, and latency-tolerant workloads, concentrating compute in a few large regions is often the right call — and Akamai runs plenty of workloads that way. The story told by the data is narrower and more useful: Centralization breaks for a specific and growing class of workloads, and many teams have not re-architected for it.
The State of AI Inference report found that 75% of organizations have moved generative AI (GenAI) into production, yet their infrastructure has not kept pace. A growing share of those workloads now carry hard real-time latency requirements that a round trip to a distant region cannot meet, and 60% of practitioners rate proximity to users and data as “important” or “critical.”
Even so, 46% still run inference from a single centralized region. That is the mismatch — not that centralization fails everywhere, but that a rising share of production inference needs to run closer to the user than a centralized footprint allows.
The workloads that break first are predictable: fraud scoring inside a live transaction, voice agents, real-time personalization, and any agentic pipeline that chains several model calls before returning an answer. Each hop inherits the latency of the last. Concentrate that in one region, and queuing delays compound as use climbs — adding hundreds of milliseconds exactly when the workload can least afford it.
So, the real question is not centralized vs. distributed as an article of faith. It is which workloads need proximity and which do not. The teams getting this right answer that question deliberately, workload by workload, instead of defaulting everything to one region and absorbing the latency tax.
Adaptable architecture and the cost you cannot see
Weak ROI is usually a sign that a team is compensating operationally instead of architecturally. The tell is unit economics: the cost of a single inference request, measured per token or per query. If you cannot see that number, you cannot optimize it, and you cannot catch the moment it runs away from you.
Most teams cannot see it: 77% of organizations lack consistent unit-level economics that track for inference, which means the majority cannot say whether a given workload is getting cheaper or more expensive as it scales. That visibility gap is also a security gap. An unexplained spike in token consumption is often the first sign of a denial of wallet (DoW) attack, in which an attacker drives inference volume specifically to run up the bill.
When the architecture cannot adapt on its own, engineers fall back on manual intervention. They reroute traffic by hand when a region spikes. They degrade response quality to keep a strained server alive. When inference slows, 51% of teams retry the same model, which usually deepens the congestion rather than clearing it. This is triage, and it does not scale. Scaling a rigid system scales its losses.
The fix is programmatic:
Tag every inference request with model and token metadata that streams into real-time monitoring, so a runaway model surfaces before it erodes the budget.
Define fail-open and fail-closed behavior in code, so the system pivots to a cached result or a smaller local model when the primary is unresponsive, without waking an engineer at 2 AM.
64% of practitioners already rate automated traffic steering as a critical requirement, which tells you where the market knows that this is heading.
Distribution is the performance mechanism
When poor ROI shows up as high latency and low conversion, geography is usually the root cause. Real-time inference is bursty and latency sensitive, and cannot be reliably served by the public internet plus a distant data center.
The math is unforgiving. If your end-to-end budget is 250 milliseconds, and computation takes 100 ms and the API handshake takes 50 ms, you have 100 milliseconds left for data to travel. Cross a continent and that budget is gone before the model does any work.
Placing inference on distributed points of presence keeps the heavy lifting in the same region as the user, bypasses public internet congestion, and removes the speed limit that a centralized footprint imposes. The response feels native because, in network terms, it is local.
When we talk about things like AI agents, it's said to take about six interactions before a task is actually accomplished. If the user is far enough away that an interaction takes 100 milliseconds, then you've got 600 milliseconds right there. This might be okay for some applications, but there are many applications today being built at the edge that are latency sensitive. The AI system that’s integrated with your vehicle is highly critical that you get all of your data and the information transited in a much shorter time.
Distribution does one more thing that matters for security. When inference and enforcement run in the same location, the network can authenticate the user, inspect the API call, and run the model in one place. You close the loop instead of shipping the request across zones to be checked somewhere else.
Security is a property of the architecture, not a separate track
If you do not secure inference, you cannot scale it. If you treat security as a parallel workstream, you pay for it in performance. Both studies say the real failure mode is not the topology; it’s unsecured, untested, and invisible APIs.
The data is jarring: 87% of organizations experienced an API security incident in the past year, up from 76% in 2022. Among teams that had incidents, attacks on AI-linked APIs were the most commonly cited API-related security incident type in the study with 42% reporting attacks on APIs linked to AI technologies.
The average incident now costs US$700,000 a year, with the top quartile of incidents costing more than US$1.8 million. APIs linked to AI are not a future risk; they are a present one.
You can’t defend what you can’t see
Visibility is moving the wrong way while this happens. Only 23% of enterprises know which of their APIs return sensitive data, down from 40% in 2022. You cannot defend an estate you cannot see, and AI is expanding that estate faster than manual inventory can track it. Copilots spin up endpoints that never get a security review. Natural-language interfaces make data extraction through prompt injection trivial for an attacker who finds the right unguarded route.
Critically, the entry point is not the whole story. An attacker who breaches a single exposed API will try to move laterally into the components that hold real value: the feature stores that curate AI data and the repositories that hold model weights and logic.
Microsegmentation, the security best practice that isolates individual workloads, is what contains that blast radius. Most organizations have not implemented it, but the ones that have contain attacks materially faster. Legacy network-based segmentation is widely recognized to be complex, laborious, and ineffective. Modern AI-powered microsegmentation addresses this, but challenges ingrained biases. If you build that segmentation into the same network that delivers the inference, you avoid the trade-off between protection and performance.
The implementation pattern is identity-based. Organizations should:
Define access by workload identity rather than by IP address, so a specific inference service can reach a specific feature store and nothing else
Run continuous API discovery to find the abandoned test endpoints still wired to production data
Security then becomes a standing property of the fabric instead of a point-in-time audit that developers route around.
A two-track problem
The deeper issue that both studies expose is organizational, not technical. Teams build traffic and security on separate tracks, and the seam between them is where things fail.
That seam shows up as a confidence gap: 40% of C-suite leaders report advanced API testing maturity, while only 28% of the DevSecOps teams doing the work agree. Leaders believe the problem is solved, so the foundation stays under-resourced while spend flows to adjacent tools. The distance between what leadership thinks is protected and what actually is protected keeps widening.
Closing that distance requires the teams managing traffic and the teams securing traffic to operate on the same, or at least a converged, control plane. When detection and containment share that plane, an anomalous request pattern typical of prompt injection or DoW can trigger automated isolation of the affected inference endpoint, with no human in the loop. That synchronization is only possible when distribution and security are not separate systems.
What scales from here
Two capabilities separate teams that scale from teams that stall:
Portability. Mature operators are measurably less locked in because they can move workloads across managed GPUs, hosted APIs, and serverless runtimes as cost and capacity shift.
Runtime governance. Controlling both the malicious prompts coming in and the data-leaking responses going out is enforced at the network layer rather than bolted onto the application.
Put those on a single platform where distribution, security, and traffic steering share one control plane, and the trade-off between protection and performance goes away. You get one view of what is running, where it runs, and whether it is under attack.
This is the problem that Akamai has been solving for content delivery and security, and is now solving for inference in production. The same global network that places inference close to users also runs the API security, microsegmentation, and distributed denial-of-service (DDoS) protection around it. That reach is what lets one network do both jobs at once.
Fragile AI ROI is a symptom
The cause is infrastructure that has not yet brought adaptable architecture, security, and distribution onto the same foundation. The performance and protection gap is widening, but it is neither structural nor permanent. That gap closes when the teams running the traffic are the teams securing it on infrastructure built to carry inference rather than retrofitted to tolerate it.
Find out more
To go deeper, read The State of AI Inference 2026 and the API Security Impact Study.
Tags