Introduction to Disaggregated Inference: Why It Matters¶
The explosive growth of generative AI has placed unprecedented demands on GPU infrastructure. Enterprises and GPU cloud providers are deploying large language models at scale, but the underlying inference serving architecture often can't keep up.
As context windows grow into the millions of tokens, as reasoning models think longer before responding, and as agentic workflows chain multiple model calls together, a fundamental architectural shift is taking hold: disaggregated inference.
In this first blog post on disaggregated inference, we will discuss how it differs from traditional serving, why it matters for platform teams managing GPU infrastructure, and how the ecosystem—from NVIDIA Dynamo to open-source frameworks—is making it production-ready.
The Two Phases of LLM Inference¶
To understand disaggregated inference, you first need to understand how large language models generate responses. Every inference request goes through two distinct phases, each with fundamentally different hardware demands.
Prefill Phase¶
This is also called the "context phase". It processes the entire input prompt at once. This is a compute-bound operation and requires massive parallel processing power to ingest and analyze all the input tokens and produce the first output token. If you've ever noticed a brief pause before a model starts streaming its response, that's the prefill phase at work.
Decode Phase¶
This is also called the "generation phase". It produces output tokens one at a time, sequentially. This is a memory-bandwidth-bound operation—each new token generation requires reading model weights and the KV (key-value) cache from GPU memory, but the actual computation per token is relatively light. The bottleneck shifts from raw compute to how fast data can be moved in and out of memory.
Prefill is compute-bound (i.e. requires raw GPU FLOPS). Decode is memory-bandwidth-bound (i.e. needs fast memory transfers and high-speed interconnects). A single GPU cannot be simultaneously optimized for both phases.
Aggregated vs. Disaggregated Inference: What Changes?¶
In traditional aggregated (co-located) serving, a single GPU or a homogeneous pool of GPUs handles both the prefill and decode phases for every request.
This means that the "model weights, the KV cache, and all the computation" live in the same place. This is simple to set up, but it creates a fundamental tension: hardware optimized for one phase is inevitably suboptimal for the other.
When prefill and decode run on the same GPU, they compete for the same resources. A compute-heavy prefill job can stall the decode of an already-in-progress response, increasing time-per-output-token (TPOT) and degrading the user experience. Scaling this model means adding more identical GPUs, regardless of whether the bottleneck is compute or memory—a costly and inefficient approach.
Disaggregated inference takes a fundamentally different approach. It separates the prefill and decode phases onto dedicated, independently scalable pools of GPUs. Compute-optimized GPUs handle the prefill phase, while memory-optimized GPUs handle decode. An orchestration layer manages the transfer of the KV cache between pools and routes requests intelligently.
Why Disaggregated Inference Is Gaining Momentum¶
The concept of disaggregated serving was first introduced in 2024 and within 18 months it has became the default architecture across virtually every major LLM serving framework. Several converging trends are driving this shift.
1. Longer Context Windows¶
Models now routinely support context windows of 128K, 256K, or even 1M+ tokens. Code generation models need to ingest entire codebases. Video understanding models process hours of footage. These long-context workloads create enormous prefill jobs that dominate GPU time in a co-located setup, starving decode operations and cratering utilization.
2. Reasoning Models with Extended Thinking¶
Models like DeepSeek-R1 and similar chain-of-thought architectures generate long internal reasoning traces before producing a final answer. This shifts the prefill-to-decode ratio dramatically, making static resource allocation even more wasteful.
3. Agentic AI and Multi-model Workflows¶
Modern AI applications chain multiple model calls together—an orchestrator model dispatches tasks to specialist models, each with different latency and throughput requirements. Disaggregation makes it possible to independently tune each phase for each model in the pipeline.
4. Cost Pressure¶
GPUs remain the most expensive line item in AI infrastructure. When enterprises move from experimentation (a few GPUs) to production (hundreds or thousands of GPUs), even small efficiency gains compound into significant cost savings. Disaggregation can enable up to 15x better throughput per GPU in certain configurations.
The Infrastructure Challenges¶
Disaggregated inference isn't a free lunch. Splitting prefill and decode across different GPU pools introduces meaningful infrastructure complexity that need to be addressed. The image below shows the typical request flow for disaggregated inference.
KV Cache Transfer¶
The intermediate state (the KV cache) generated during prefill must be efficiently transferred to the decode pool. This requires low-latency, high-bandwidth interconnects—whether NVLink within a node or RDMA across nodes. The transfer must be fast enough that it doesn't negate the performance gains of disaggregation.
Request Routing¶
An LLM-aware router must track which decode GPUs hold which KV caches and route follow-up requests accordingly. This avoids redundant recomputation and is critical for multi-turn conversations and agentic workflows where context is reused across calls.
Memory Management¶
The KV cache can be enormous for long-context models. The system needs intelligent offloading—moving inactive caches to CPU memory, NVMe, or other tiers—and on-demand retrieval when a request resumes.
Auto Scaling¶
The prefill and decode pools need to scale independently based on shifting workload patterns. A burst of new long-context requests requires more prefill capacity, while a steady stream of ongoing generations requires more decode capacity. The ratio between pools is dynamic and workload-dependent.
Disaggregated inference transforms GPU infrastructure from a flat pool of identical resources into a heterogeneous, multi-tier system that requires orchestration, governance, and intelligent scheduling.
What This Means for GPU Cloud Providers and Enterprises¶
For organizations building and operating GPU infrastructure—whether as an internal platform for AI teams or as a commercial GPU cloud for external customers—disaggregated inference has profound implications.
Higher ROI on GPUs¶
Right-size hardware for each phase, eliminating the waste of one-size-fits-all GPU pools. Allocate compute-dense GPUs where they matter most (prefill) and use more cost-effective, memory-optimized GPUs for decode.
Better Latency SLOs¶
Independently control time-to-first-token (TTFT) and time-per-output-token (TPOT). With disaggregation, you can tune each latency metric without compromising the other—essential for production SLA compliance.
Heterogeneous Fleet Management¶
Mix GPU generations and types across your fleet. Assign older-generation GPUs to decode workloads and newer GPUs to compute-heavy prefill. This extends the useful life of existing hardware investments.
Multi-Tenant Efficiency¶
Share disaggregated pools across tenants with per-project quotas and isolation policies. A well-orchestrated disaggregated cluster can serve multiple teams and workloads far more efficiently than siloed, per-team GPU allocations.
A Practical Starting Point¶
If you are exploring disaggregated inference, here's a pragmatic path forward.
Step 1: Workload Profiling¶
Not every workload benefits equally from disaggregation. Short-context, low-concurrency workloads may run just fine on co-located serving. Long-context, high-concurrency, and latency-sensitive workloads are where disaggregation shines.
Profile your actual traffic patterns—input sequence lengths, output sequence lengths, concurrency levels, and SLA requirements—before re-architecting.
Step 2: Evaluate your Interconnect¶
KV cache transfer speed is the critical path in disaggregated serving. If your GPUs are connected via high-bandwidth NVLink or InfiniBand with RDMA, you're in good shape.
If your GPU pools communicate over standard ethernet, the transfer overhead may reduce or eliminate the gains from disaggregation.
Adopt a Platform Approach¶
Disaggregated inference adds infrastructure complexity. A platform-as-a-service layer that provides self-service GPU provisioning, multi-tenant isolation, quota management, and observability becomes essential—not optional—when running heterogeneous GPU pools at scale.
Looking Ahead¶
The takeaway is clear: the era of treating GPU infrastructure as a flat, homogeneous pool is ending. The future is heterogeneous, workload-aware, and dynamically orchestrated. Building—or adopting—the right platform layer to manage this complexity is what will separate AI infrastructure leaders from the rest.
In the next part of the blog series on Disaggregated Inference, we will dive deeper into NVIDIA's Dynamo project.
-
Free Org
Sign up for a free Org if you want to try this yourself with our Get Started guides.
-
Live Demo
Schedule time with us to watch a demo in action.



