Akamai acquires LayerX, delivering end-to-end security and real-time AI usage control to any browser. Get details

Back Products Close

Cloud Computing

Cybersecurity

Content Delivery

See all products

Our Infrastructure

Global Services

Back Cloud Computing Close

Artificial intelligence (AI)

Akamai Inference Cloud

Storage

Object Storage

Block Storage

Backups

Databases

Managed Databases

compute

GPU

CPU

Kubernetes

App Platform

Accelerated Compute

Serverless

Akamai Functions

Networking

Cloud Firewall

DNS Manager

NodeBalancers

Private Networking

View cloud pricing

Explore plans and pricing that fit your needs — from small projects to global-scale deployments.

See pricing

Get started with Akamai Cloud

Sign up today and unlock cloud computing, edge, and AI tools built for your business.

Sign up

See all Cloud Computing

Back Cybersecurity Close

app and api security

API Security

App & API Protector

Firewall for AI

Client-Side Protection & Compliance

Bot & Agent Control

Account Protector

Content Protector

Bot Manager

AI Brand Presence

Segmentation

Akamai Guardicore Segmentation

zero trust security

Akamai Workforce Protector (formerly LayerX)

Secure Internet Access

Enterprise Application Access

Akamai MFA

Identity, Credential, and Access Management

infrastructure security

Edge DNS

Prolexic

IP Accelerator

DNS Posture Management

Brand Guardian

Get started with Security

Protect the applications that drive your business — every day, every time.

Contact Sales

See all Cybersecurity

Back Content Delivery Close

Application performance

Ion

API Acceleration

IP Accelerator

Media Delivery

Adaptive Media Delivery

Download Delivery

Edge Applications

EdgeWorkers

EdgeKV

Image & Video Manager

Media Services Live

Cloudlets

Cloud Wrapper

Global Traffic Management

Monitoring, reporting and testing

Data Stream

mPulse

CloudTest

Get started with Content Delivery

Trust the agility and scale of Akamai to help you flawlessly deliver extraordinary digital experiences.

Contact Sales

See all Content Delivery

Back Solutions Close

Cloud Computing

Serverless

Media

SaaS

Gaming

See all Cloud Computing

security

Frontier AI Security Risks

Akamai Application Protection Platform

Cybersecurity Compliance

Ransomware Protection

Secure Apps and APIs

DNS Delivery and Security

Zero Trust

DDoS Protection

Bot & Agent Control

Identity, Credential and Access Management

See all Cybersecurity

content delivery

App and API Performance

Media Delivery

See all Content Delivery

industry solutions

Media and Entertainment

Retail, Travel, and Hospitality

Financial Services

Healthcare and Life Sciences

Public Sector

Defense

Games

Online Sports Betting and iGaming

Service Providers

See all Industry Solutions

Back Pricing Close

Security and Delivery

Get started

Contact Sales

Free trials

Cloud pricing

GLOBAL PRICING

North America pricing

Europe pricing

Asia Pacific pricing

South America pricing

SPECIFIC LOCAL PRICING

Jakarta pricing

See all pricing

Cloud pricing

Try Akamai Cloud with US$100 in credits*

Deploy faster with global cloud infrastructure — no surprise bills, no lock-in, and transparent pricing across every data center.

Try now

*See Promotion Redemption Rules & Conditions

Back Developers Close

Cloud developers

Developer hub

Akamai GitHub repo

docs and guides

Cloud docs

Guides and tutorials

cloud marketplace

Developer apps

Get started with Akamai Cloud

Sign up today and unlock cloud computing, edge and AI tools built for your business.

Sign up

Back Resources Close

What’s new

Akamai blog

Events and workshops

Learning

White papers, ebooks, videos, product briefs

Customer stories

Training and certifications

Cybersecurity Research

Akamai Security Intelligence Group (SIG)

State of Internet (SOTI) reports

Partners

Partner with Akamai to innovate, scale, and grow your advantage

Channel Partners

Partner Portal

Partner Stories

Technology Partners

Technology Partners Directory

Log in

Back Log in Close

Cloud Manager
Manage your cloud computing services

Back Log in Close

Control Center
Manage your security and delivery services
- Docs
- Sales
- Support
- Under Attack ?
English
Back Language Close
- English
- Deutsch
- Español
- Français
- Italiano
- Português
- 中文
- 日本語
- 한국어

Create account

Under Attack?

Akamai Cloud

Akamai Security and Delivery

Connect with our Sales team to discuss your business needs and find the right solutions.

Contact Sales

Stop Treating Your LLMs Like Web Servers

Jun 18, 2026

Du’An Lightfoot

Written by

Du’An Lightfoot

Du'An Lightfoot is a Senior AI Engineer at Akamai and international speaker who brings a unique perspective to AI transformation through his unconventional journey from USAF veteran in Afghanistan to technical innovation leader at Akamai.

Self-hosting a large language model (LLM) endpoint is straightforward to start, but production traffic exposes bottlenecks that traditional web service monitoring may miss.
The performance of LLM inference is dictated by GPU memory bandwidth, key value cache availability, and batching dynamics, rather than traditional metrics like CPU use or network request volume.
Under load, an LLM endpoint may continue returning 200 OK while latency quietly climbs to unusable levels.
vLLM’s continuous batching makes GPU use far more efficient, but batching also has a cliff where throughput flattens and tail latency rises.
Production inference needs bounded admission, context limits, and queueing outside the engine so degraded requests can be controlled before they become silent failures.

This year I’ve been focused on self-hosted AI, looking past agent runtimes at the underlying inference layer. My goal has been to understand how to self-host models that actually fit on a GPU along with their context, while providing a great user inference experience to customers. At scale, that gets complicated quickly.

I spoke on this topic at KubeCrash Spring 2026 and got more questions from developers than I had time to answer. Some had privacy requirements, some were watching their per-token bill climb every month, and others had compliance rules that don't allow prompts to leave their own infrastructure. They all boiled down to the same question: What does it actually take to run LLMs in-house?

I’m a builder, so my answer to this challenge was simple: Build it. I developed a real LLM endpoint on a real GPU on Kubernetes. To take it even further, I created fluctuating workloads. Then I broke the whole thing.

Here’s what I learned.

Start with the right hardware assumptions

Another common question I get is about the need to use a GPU for inference versus a CPU, especially when teams already have them and they’re less expensive.

It’s a fair question, and the answer comes down to two things that show up later in this post as failure modes: parallelism and memory bandwidth.

The first is parallelism. A CPU has a handful of powerful cores built to handle different tasks quickly and successively. A GPU operates differently; it has thousands of smaller cores all built to do the same math on a lot of data at the same time. Inference is mostly one operation, matrix multiplication, running billions of times — exactly what a GPU is built for. Because a CPU will try to do that work sequentially, it will quickly fall behind.

The second reason is memory bandwidth. To generate text, the GPU has to read the entire model out of memory for every single token it produces (versus performing math on it).

To put that into context, a modern server CPU pulls approximately 50 to 100 GBps out of its RAM. The GPU used for this build moves 480 GBps, and high-end data center cards move over 3,000 GBps. The difference is a response that streams versus one that crawls.

In fact, memory bandwidth is the thread that runs through every failure mode in this post.

Use vLLM for production inference patterns

Many developers are running Ollama on a laptop. That works for a development environment, but for production workloads, you need high throughput and the ability to serve LLMs to hundreds of concurrent users simultaneously.

A virtual large language model (vLLM) applies a memory management technique called PagedAttention, which pages memory on demand instead of reserving fixed memory chunks for each user. As a result, vLLM can batch many user prompts through the same forward pass without queue bottlenecks.

vLLM also provides parallelism to distribute workloads across multiple GPUs. The main patterns are:

Tensor parallelism (TP) for low latency
- Imagine having a 70B parameter model. Running this model at full precision would require a GPU that has 140 GB of VRAM. Quantization could help make it fit. However, if compute is constrained across smaller GPUs, TP can shard every layer across all available GPUs.
- Then, when a computation happens, all GPUs work on their specific slice of the math simultaneously and synchronize. This reduces the time it takes to generate each token, making the model feel fast.
Pipeline parallelism (PP) for larger models. If a model is so big that it spills over into a second physical server, TP struggles because the network cables connecting the servers aren’t fast enough to handle the constant, microscopic syncing. PP solves this by slicing the model sequentially: Server A handles the first set of layers, then hands the baton to Server B, which handles the remaining layers.
Data parallelism (DP) for high traffic. If a model does fit on one GPU, but 1,000 users are hitting it, DP spins up complete, independent replicas of the model on every available GPU. This is what I used recently in a workshop.
Prefix caching for repeated prompts. Providers often offer prompt caching as a way to save on costs and improve application performance. Imagine an application sends a large system prompt every time the model is invoked. This gets expensive. Prefix caching stores the mathematical state of that prompt and shares it across all users, dropping time to first token (TTFT) to near zero.

Deploy the endpoint, then test beyond the first request

To get started, I deployed vLLM on Linode Kubernetes Engine (LKE). The setup used one GPU node, an NVIDIA RTX™ 4000 Ada Generation with 20 GB of VRAM, running Qwen2.5-7B-Instruct. I used Terraform as the infrastructure as code to bring up the cluster, created a few manifests to deploy the workload, and had an OpenAI-compatible endpoint answering requests about 10 minutes later.

# manifests/vllm-deployment.yaml
args:
  - --model=Qwen/Qwen2.5-7B-Instruct
  - --download-dir=/models
  - --api-key=$(VLLM_API_KEY)
  - --host=0.0.0.0
  - --port=8000

The entire serving config is only five lines of flags. We’ll revisit these at the end of this blog post to see that the fix for everything I am about to demonstrate is two more lines added to this block.

Once it was deployed, I sent a prompt. I received a clean response, and it took about 250 ms to receive the first token. The TTFT is a measurement that matters when serving inference or using a provider's inference in an application. At 250 ms, this felt fast. I thought I was done.

But not so fast. One request tells you almost nothing about how a model serves traffic. The challenges appear when the second, third, and fiftieth users show up, especially as larger contexts start passing to the model.

Watch for silent latency failures

After deploying this and overloading the LLM endpoint, the first thing I noticed was that it didn’t return errors. This is unlike traditional applications.

When a web service fails under load, it throws 500 errors and you figure out how to handle it in your application. But an LLM endpoint under load keeps returning 200 OK and just takes 30 seconds to respond. Dashboards can say everything is healthy with the API and cluster — meanwhile, every user is waiting. While this degradation is real, it’s invisible to traditional metrics monitoring.

There is not a single cause for why this is happening. And if you do not understand why it slows down, you cannot tell a slow endpoint from a broken one, and you cannot fix either of them.

Coming from a networking and infrastructure background, most of my instincts about scaling services helped here. They got in the way, though, when I attempted to troubleshoot the issue.

	Web service	LLM inference
Bottleneck	CPU, network	GPU memory bandwidth
Per-request cost	Roughly constant	Grows with context length
Concurrency	Stateless	Per-request state held in GPU memory
Failure under load	Returns 500	Latency degrades silently
Scaling	Add pods	Add pods and GPUs

A web request is cheap, stateless, and predictable, but an LLM request holds state on the GPU for its entire life. It costs more as the conversation grows, and it competes with every other in-flight request for the same fixed pool of memory. As a result, treating it like a web service leads to expensive mistakes.

Understanding the 2 phases of inference

In AI, inference is the process of feeding a trained machine learning model a prompt and running it through its neural network so the model can make a prediction (generate an output). In generative AI, this runs in two phases — including prefill (key value cache) and decode — and each phase bottlenecks or fails for different reasons (Figure 1).

Prefill (key value cache): This phase processes the entire prompt to create the key value (KV) cache. It turns text into numbers and maps the relationships between tokens, such as the prompt, system prompt, and tools. It runs in parallel across every input token at once. It is compute heavy, and it drives TTFT.
Decode: This phase generates the response one token at a time. This happens after prefill and is sequential, meaning that each new token depends on every token before it. During this phase, decode is not limited primarily by compute, but by memory bandwidth. Every single token requires reading the entire model out of VRAM.

Here’s the math: The RTX™ 4000 GPU has about 480 GBps of memory bandwidth. The Qwen2.5-7B-Instruct model (at 16-bit precision) is roughly 14 GB. That puts the theoretical ceiling at approximately 34 tokens per second for a single request. That ceiling exists because the GPU is bound by memory bandwidth and can’t read the weights any faster.

The math calculates the theoretical ceiling for generating the first token. In real-world scenarios, generation speed will start slightly slower — closer to 28–30 tokens per second — and will continue to drop as the model writes its response.

480 GB/s ÷ 14 GB = 34.28.

This happens because of how decode works. To generate a new token, the GPU has to read the 14 GB of model weights plus the KV cache it built from the initial prompt. Every time the model generates a new token, it appends that token's math to the KV cache.

The cache gets heavier as the response gets longer, and the GPU has to pull more data across that same 480 GBps pipe each step. That’s why generation speed (tokens per second) gradually slows down the longer the output gets.

Failure mode 1: Throughput flattens while latency climbs

If a single request tops out at around 34 tokens per second, how does anyone serve an LLM to real traffic? The answer is continuous batching, which might be the most important optimization in modern inference.

Think of it like this: When one user sends one request, the GPU reads all 14 GB of model weights just to generate a single token for that one person. For the next token, it repeats the entire process. This giant memory read is the most expensive thing the GPU does and is actually one of the worst ways to use the hardware because, in this scenario, the rest of the GPU is underused.

Continuous batching fixes that. The GPU still runs one forward pass per output token, but now vLLM packs every in-flight request into that same pass. The 14 GB read still happens once, but the difference is that it now produces a token for every user in the batch at the same time. The system was already paying for that read, and batching stopped it from going to waste.

The measurements are shown in this table:

Concurrent users	Aggregate throughput	Per-user speed
1	34 tok/s	34 tok/s
32	~525 tok/s	~16 tok/s
64	~1,041 tok/s	~16 tok/s

There are two stories in this table.

The aggregate column is the real win. One user gets 34 tokens per second out of the GPU. Sixty-four users get more than 1,000 with the same GPU. That’s roughly a 30x jump in what the hardware produces and the entire reason self-hosting an LLM makes economic sense today.

The per-user column is the trade-off. A single user inside a batch of 32 only sees about 16 tokens per second, which is slower than the 34 they would get if they had the GPU all to themselves. That means batching doesn't make any one request faster, but it does let the system serve a lot more people for the same amount.

For a real chat product, that trade is typically the goal — 16 tokens per second is still faster than anyone can read, and now the system is serving 32 people instead of one.

Batching does have a limit, though. As users keep arriving, requests will start showing up faster than the GPU can prefill them. This causes aggregate throughput to stop climbing because the GPU is already full. That extra load has nowhere to go, so it backs up into a queue. Once that happens, tail latency (that is,the slowest 5% of requests) starts to climb — and it doesn’t stop.

It’s known as the batching cliff: Throughput goes flat and latency keeps rising. The model and GPU are fine, but the cost shows up in the queue.

Failure mode 2: The KV cache runs out of room

The second failure is about memory.

Every in-flight request needs a KV cache, which is the running attention state for its conversation. That cache lives in GPU memory, in the same 20 GB pool that the 14 GB of model weights already occupy. In this setup, after the model loads, what’s left is a pool of about 38,800 tokens of KV cache. That pool is shared across every request in flight (Figure 2).

That pool is shared across every request in flight (Figure 2).

Fig. 2: Memory allocation: 14 GB for static model weights, 4 GB for the KV cache pool, and 2 GB for transient spike headroom

Here’s the math. Thirty-two concurrent users, each sending a 4,000-token prompt, demand 128,000 tokens of cache. The pool holds 38,800. I asked for more than three times what exists.

curl -s $ENDPOINT/metrics | grep cache_config_info
# vllm:cache_config_info{block_size="16",num_gpu_blocks="2426",...} 1.0
#
# 2426 blocks × 16 tokens per block = 38,816 tokens

This didn’t result in an error, but it was a slow and painful collapse in performance. The first nine or so requests landed in a few seconds, but then the TTFT climbed to 10 seconds, 20 seconds, 30 seconds, and finally flattened out around 55 seconds for everyone else. Every request past the ninth request waited for more than a minute. In production, that’s a P0 incident, but the error rate would have read zero the entire time.

Fix degraded inference with bounded admission

Compute and memory are two different bottlenecks, but they both showed the same symptom of latency degrading quietly.

The fix for both scenarios is the same: Cap what the engine accepts and queue everything else outside it.

In vLLM, that’s mostly one flag: --max-num-seqs. Set it to eight, and the engine admits a maximum of eight requests at a time. The ninth waits at the door, greatly improving the application's performance.

# Before — the naive deployment
args:
  - --model=Qwen/Qwen2.5-7B-Instruct
  - --api-key=$(VLLM_API_KEY)
  - --host=0.0.0.0
  - --port=8000

# After — bounded admission
args:
  - --model=Qwen/Qwen2.5-7B-Instruct
  - --api-key=$(VLLM_API_KEY)
  - --host=0.0.0.0
  - --port=8000
  - --max-num-seqs=8       # cap concurrent in-flight requests
  - --max-model-len=8192   # cap context per request

A queue inside the engine grows without bound, and there isn’t much control over it. A queue outside the engine at the gateway, however, can be managed. It can shed, prioritize, rate-limit, or return a fast, honest “try again” instead of a silent 55-second hang — all while serving the same load on the same hardware. The only difference is where the work waits.

When the bounded config runs against the same workload, the admitted requests stay near baseline latency. The rest wait a predictable amount of time, the P95 stops climbing and plateaus, and the model never goes down. Instead, it just stops accepting work it can’t do well. This is the pattern needed by any production LLM serving, and vLLM gives teams the flags to do it cleanly.

Design for inference failure from day 1

Self-hosting an LLM isn’t hard to start and privacy, cost, and compliance are all valid reasons to take this approach. However, it can be a challenge to run under real traffic at scale.

The difficulty isn’t where many teams expect — it’s not in the deployment

Rather, the difficulty is in understanding that endpoints can be fully degraded while every health check is green. It also includes knowing whether the system is out of compute or memory, because the symptoms are identical but the fix depends on the cause. The recovery pattern is bounded admission and queueing at the gateway. This is something you design on day one, not something you bolt on after the first bad afternoon.

Try it on Akamai Cloud

Everything here is reproducible. The full setup, including the Terraform, Kubernetes manifests, load-testing scripts, and monitoring stack, is in a public repo.

Akamai Cloud has free sign-up with US$300 in credits.

Stand up the cluster, deploy the model, then put it under load and watch it break. That’s where the learning happens.

Get started

Jun 18, 2026

Du’An Lightfoot

Written by

Du’An Lightfoot

View cloud pricing

Get started with Akamai Cloud

Get started with Security

Get started with Content Delivery

Security and Delivery

Cloud pricing

Cloud pricing

Try Akamai Cloud with US$100 in credits*

Get started with Akamai Cloud

Partners

Akamai Cloud

Akamai Security and Delivery

Stop Treating Your LLMs Like Web Servers

Key takeaways

Start with the right hardware assumptions

Use vLLM for production inference patterns

Deploy the endpoint, then test beyond the first request

Watch for silent latency failures

Understanding the 2 phases of inference

Failure mode 1: Throughput flattens while latency climbs

Failure mode 2: The KV cache runs out of room

Fix degraded inference with bounded admission

Design for inference failure from day 1

Try it on Akamai Cloud

Related Blog Posts