Akamai to acquire LayerX to enforce AI usage control on any browser. Get details

AI Inference Is Swallowing the Cloud

Robert Blumofe

Jul 01, 2026

Robert Blumofe

Robert Blumofe

Written by

Robert Blumofe

Dr. Robert Blumofe is Executive Vice President and Chief Technology Officer at Akamai. As CTO, he guides Akamai’s technology strategy, works with Akamai’s largest customers, and convenes technology leaders within the company to catalyze innovation. Previously, he led Akamai’s Platform organization and Enterprise Division, where he was responsible for developing and operating the distributed system underlying all Akamai products and services, as well as creating solutions for major enterprises to secure and improve performance. He holds a Ph.D. in Computer Science from Massachusetts Institute of Technology and a Bachelor of Science from Brown University.

Share

More than a decade after software began eating the world, AI is now eating software. AI is changing the very nature of software: its role in the human ecosystem and how it serves humankind. The consequences of this change are profound and reach into nearly every aspect of human endeavor. The focus here, though, is on the consequences for infrastructure — the cloud, in particular. 

As AI devours software, is the cloud the final course in that meal?

The short answer is yes, but not because AI will eliminate the cloud. Rather, AI will dramatically and irrevocably alter it. Specifically, it is generative AI (GenAI) — encompassing large language models (LLMs), image- and video-generation models, and their orchestration within autonomous agents — that is the most consequential form of AI for cloud infrastructure.

We all know that GenAI is ravenous: ravenous for power, ravenous for computation, and ravenous for storage. So as AI becomes part of everything we do: How do we feed the beast?

The current dominant approach is essentially brute force: Expend massive amounts of capital to build out massive, centralized data centers that host massive AI models. This approach will fail. It is economically unsustainable. It is ecologically disastrous. Most critically, it is architecturally incapable of scaling to meet the looming demand.

We refuse to accept a future defined by a new “World Wide Wait.” The industry must move beyond the illusion of hypercentralized infrastructure. We must be smart about matching infrastructure to use cases, tailoring the technology, and meeting agents where they actually live. The cloud must adapt, decentralize, and evolve … or it will be consumed.

Fortunately, there is a better way to answer this question. Not with brute force. With intelligence.

History repeating itself

In 1998, the World Wide Web was eating the internet. Before the web, the internet was mostly email, telnet (a remote access protocol), FTP(a file transfer protocol), and Usenet (a message board organized by topics, not unlike today’s forums and subreddits). There was no streaming media, no live video conferencing, and no online shopping, travel planning, banking, or healthcare. 

As such, it was used almost exclusively by academics, computer scientists, and military and government personnel. The web changed all that. Though some of the web capabilities that we now use every day took longer to emerge, by 1998 (just 7 years after the web became available) websites were sprouting like weeds and everyone was taking notice.

The architectural flaw of the early web

That wild success, however, led to enormous demand. This, in turn, led to some widely publicized failures, which led to the joking refrain: WWW should stand for “World Wide Wait.” Some pundits even predicted that the web was going to kill the internet. 

The problem was that websites were built and deployed in a centralized hub-and-spoke model where every user request had to route all the way back to a central origin server. The brute-force solution — building out massive amounts of centralized infrastructure -– would be wildly expensive. And would it even work?

That’s when Professor Tom Leighton and his graduate student, Danny Lewin, stepped up and founded Akamai Technologies. They proposed an alternative to the brute-force solution: an intelligent solution using math, algorithms, and distributed systems. 

Their solution, now known as a content delivery network (CDN), distributes storage and computation to the edge of the internet so that web applications can be delivered from locations that are near the people, devices, and things that are using those applications. 

About a decade later, Akamai successfully brought this same intelligent approach to the problem of cybersecurity. With the clarity of hindsight, we can safely assert that brute force could never have solved these problems.

Fast forward to today. We see AI transforming the web every bit as profoundly (if not more so) as the web transformed the internet. So how can brute force be the answer? How can massive investment in massive, centralized data centers hosting massive models be the answer? 

Geography doesn’t care about capital, and physics doesn’t care about dollars. If every invocation of a GenAI model has to traverse thousands of miles through a network followed by trillions of weights through an AI model, the result will be even worse than the “World Wide Wait”. We will slide inevitably into the very structural gridlock we've warned against: the “Large Language Molasses.”

We don’t have to. Not if we’re smart.

Generative AI’s true superpower: Tools

For most of us, our first interaction with GenAI came in the form of a chatbot like ChatGPT. This was little more than a simple chat interface on top of an LLM, but it captured the popular imagination like very few other innovations have throughout history. That was just a few years ago, and GenAI has already gone far beyond the chatbot. We’ve swiftly moved from AI being the application to AI powering the application.

As AI devours software, is the cloud the final course in that meal? 

When AI is part of an application, we get agents. Agents can write code; agents can conduct research; agents can summarize email and draft replies; agents can find a great restaurant and make dinner reservations; agents can help you plan a trip, make the hotel, airline, and rental-car reservations, and give you turn-by-turn directions from the rental-car center to the hotel; and agents can help you shop for a great shirt that will go with your favorite pants (and make you look a bit younger for your upcoming college reunion). 

Pretty much everything we do today by navigating web pages, filling out forms, and clicking links can and will be replaced by agents.

Deconstructing the anatomy of an agent

But an AI agent is not just a super powerful LLM with advanced LLM-reasoning capabilities. An agent is a system with many components, one of which is an LLM. An LLM typically plays the central role, managing the natural-language communication and making the decisions that guide the conversation and determine the sequence of steps. But an agent may have many other AI models. 

For example, a fashion-consultant agent may use image or video generation to show you how a garment will look on you. And an agent will have tools: non-AI components that allow it to search the web, read and write files, run programs on the command line, and invoke APIs. For example, that fashion-consultant agent might use a tool to access information about your past purchases and preferences so it can show you a shirt in your favorite color paired with those pants you recently bought.

Architectural intelligence dictates a clear rule: When building an agent, put as much functionality as possible into the non-AI tools. We must reject the impulse to use AI for everything. As powerful as today’s AI is, it can’t do everything, and even for the things it can do, it’s almost always wildly inefficient. 

Consider arithmetic. It’s actually quite amazing that LLMs can often produce correct answers to arithmetic problems, but they sometimes get the wrong answer. Moreover, even when they do get the right answer, they’re expending many orders of magnitude more compute and energy than a calculator. 

As in other examples, we do not need a trillion-parameter model to scan text when a regular expression works perfectly, nor do we need deep learning to map a route when a shortest-path algorithm is already optimized for the job.

This is the true superpower of modern GenAI: LLMs can invoke tools. If not for that, the industry would still be blissfully unaware of the word “agentic.” We must give the LLM the tools it needs and let it do only what it is uniquely good at. We must stop expending megawatts to solve problems that can be solved with milliwatts.

The great compute misallocation

From megawatts to milliwatts, when it comes to infrastructure, we need to be smart about matching infrastructure to use cases. The recent AI mania has fueled massive investment in massive, centralized data centers hosting massive models on massive, dense GPU clusters. 

Recent cloud investments have been dominated by GPUs, and we’ve even seen the emergence of a new kind of cloud, the neo-cloud, focused almost exclusively on GPU infrastructure for AI. But not everything requires a dense cluster of GPUs.

Dense GPU clusters made sense and still make sense when the primary AI use case is training, especially the pre-training of foundation LLMs. Of course, this was and continues to be a prerequisite for everything that’s happening in GenAI today. 

But what about actually using the models, otherwise known as inference? That is the way we realize AI’s real-world value. And for many inference use cases, dense, centralized GPU clusters are an architectural mismatch.

Indeed, when doing inference with a massive GenAI model, a dense GPU cluster is likely the most efficient and maybe the only way to get acceptable levels of performance. But not everything requires a massive model. Many workloads function best with specialized models that are significantly smaller than "Ask me anything” models. 

An AI application running inside a car to manage climate control and entertainment systems, for instance, does not need to understand theoretical physics. Just as an AI agent running in the cloud whose sole job is to help a patient schedule a dentist appointment has no reason to compose sonnets or summarize the plot of every episode of “M*A*S*H.” 

Using a trillion-parameter model for these narrow tasks is profoundly wasteful. It is far smarter to deploy smaller, specialized models that run efficiently on less expensive GPUs, or even standard CPUs.

Why the agentic footprint is inherently hybrid

Furthermore, because an agent is an LLM integrated with tools, its infrastructure footprint is inherently hybrid. While the central LLM may require specific acceleration, the tools it invokes are not numerically dense and have no need for a GPU. 

Consider, again, the fashion-consultant agent. It may leverage advanced image or video generation to demonstrate how a garment looks on a person, but it relies heavily on traditional tools to pull past purchase history, access user preferences, and query inventory databases. These tools run on CPUs, demand significant storage, and rely on constant communication with remote services.

An AI agent's infrastructure needs cannot be reduced to a single, one-size-fits-all infrastructure type. They are fundamentally hybrid: a dynamic combination of GPU, CPU, storage, and communication.

The cloud has always triumphed because it was built to be a combination of flexible, hybrid compute and networking. Just as we do not use AI for every task, we must stop forcing GPUs onto every workload. The hardware must bend to the use case, not the other way around.

The myth of Agentlandia

It is tempting to imagine AI agents existing in their own insulated realm, chattering among themselves and getting things done with little direct interaction with us humans. In this myth, the agents are massive LLMs, running on massive GPU clusters, in a handful of hypercentralized data centers. But this myth does not survive scrutiny.

The reality is that today's most popular agent frameworks — like OpenClaw, Hermes Agent, Claude Cowork, and Claude Code — are commonly installed and executed directly on the desktop. While they may query foundation LLMs running in centralized clouds, their other components run on the desktop. 

Moreover, it’s become increasingly popular to use alternative LLMs that also run on the desktop. These agents can also be run in the cloud, any cloud, with a strong preference for a cloud location that is nearby to the user.

A parallel trend is playing out in the enterprise. 

Companies are building and deploying agents for use by their employees and customers, often with the help of agent frameworks such as LangChain, Pydantic AI, n8n, and CrewAI. These agents can also be run in the cloud, any cloud, of their choosing and may be configured to use massive, centrally hosted LLMs from foundation-model providers. But again, it’s becoming increasingly popular to use alternative LLMs, specialized for their use case and running in their cloud of choice.

There is no isolated “Agentlandia.” AI agents will run anywhere and everywhere: on edge devices, in cars, on desktops, and across highly distributed cloud environments chosen for proximity to the user.

Mapping the new distributed communication web

To get anything done, these agents must interact constantly with a distributed web of services. They search remote websites, query local databases, and invoke remote APIs. A single user request may trigger a flurry of back-and-forth communication between the LLM, local tools, and remote services.

Furthermore, agents must interact with other agents across different cloud ecosystems, all while maintaining rich, multimodal, conversational loops with human beings. The complete picture is a dizzying, highly interconnected, heavily utilized, and massively distributed web of communication channels spanning the globe.

The industry challenge: Meeting agents where they live

Infrastructure must meet agents where they live, and that means everywhere. We cannot force the agentic ecosystem into centralized clusters. We must support it with distributed infrastructure in the exact locations where the models run, where the tools execute, and where the users interact. These locations must provide low-latency, high-bandwidth connections to keep the ecosystem moving.

This does not imply that centralized data centers are going away: Dense GPU clusters remain the correct solution for heavy training workloads and specific massive inference tasks. Rather, centralized hubs must be augmented with highly distributed edge infrastructure. The future of the cloud is a flexible continuum stretching from the core to the edge.

The brute-force build-out of hypercentralized megastructures simply does not align with the operational reality of modern GenAI. It is wasteful, it will not scale, and it will inevitably stall progress in its “Large Language Molasses.”

Our challenge to the industry is to build a cloud for the AI era that is smarter, more adaptable, and fiercely flexible. We must demand highly flexible hybrid combinations of CPU, GPU, storage, and connectivity, deployed across a fluid continuum of locations. The cloud must evolve to support the distributed reality of AI — or risk being entirely consumed by it.

Robert Blumofe

Jul 01, 2026

Robert Blumofe

Robert Blumofe

Written by

Robert Blumofe

Dr. Robert Blumofe is Executive Vice President and Chief Technology Officer at Akamai. As CTO, he guides Akamai’s technology strategy, works with Akamai’s largest customers, and convenes technology leaders within the company to catalyze innovation. Previously, he led Akamai’s Platform organization and Enterprise Division, where he was responsible for developing and operating the distributed system underlying all Akamai products and services, as well as creating solutions for major enterprises to secure and improve performance. He holds a Ph.D. in Computer Science from Massachusetts Institute of Technology and a Bachelor of Science from Brown University.

Tags

Share

Related Blog Posts

AI
Your AI Cost Model Stops at the Token Price. The Bill Doesn't.
June 25, 2026
Your AI cost model stops at the token price, but the bill doesn't. Discover why almost 80% of production AI spend sits in inference and how to optimize your setup.
AI
Agentic Disconnect: The Latency Crisis Facing Modern AI Architecture
June 24, 2026
Centralized public clouds are creating an architectural bottleneck for agentic AI. Discover how Akamai’s distributed cloud solves the multi-agent latency crisis.
AI
Stop Treating Your LLMs Like Web Servers
June 18, 2026
Stop treating your self-hosted LLMs like web servers. Discover why AI inference fails silently under load, how GPU memory bottlenecks, and how to fix the problem.