Skip to content

NVIDIA Dynamo: Turning Disaggregated Inference Into a Production System

In Part 1, we covered the core idea behind disaggregated inference. That architectural split is no longer just a research pattern. Disaggregated inference changes inference from a simple “deploy a container on GPUs” exercise into a distributed system problem.

Once prefill and decode are separated, the platform has to coordinate routing, GPU-to-GPU KV cache transfer, placement, autoscaling, service discovery, and fault handling across multiple worker pools. NVIDIA Dynamo provides the distributed inference framework for this, and Kubernetes provides the control plane foundation to operate it at scale. 

In this blog post, we will review NVIDIA's Dynamo project with a focus on what it does and when it it makes sense to use it.

NVIDIA Dynamo Logo


What and Why Dynamo

NVIDIA Dynamo is an open-source, production-grade foundation for inference at scale. It is positioned as the orchestration layer for complex distributed inference topologies rather than a replacement for the backend runtimes themselves. 

Dynamo sits above engines such as vLLM, SGLang, and TensorRT-LLM, coordinating them into a multi-node inference system with routing, KV-cache-aware scheduling, disaggregation, memory tiering, and autoscaling.

That distinction matters.

If your workload is small, a single-node vLLM or TensorRT-LLM deployment may be enough. But once you need to serve long-context requests, agentic workflows, or high-concurrency conversational traffic, you are no longer optimizing a single model server. You are optimizing how work moves across a GPU fleet.


Single-Worker Serving to Disaggregated Serving

As we learn in the previous blog, every request still has the same two phases: Prefill and Decode. What changes with Dynamo is how those phases are orchestrated across workers.

In aggregated mode, one worker handles the full lifecycle. In disaggregated mode, Dynamo routes the request first to a prefill worker, then transfers the resulting KV state to a decode worker, which continues generation.

With Dynamo, this becomes a three-step flow:

  1. Prefill computes the KV cache
  2. The KV cache is transferred to the decode worker
  3. Decode continues on the target worker. 

That separation is what eliminates the classic interference problem from Part 1. Long prompt ingestion no longer has to stall token generation for unrelated users on the same GPU. Instead, you get independent pools for compute-heavy prefill and memory-bandwidth-heavy decode, each scaled to the actual bottleneck. 


What Dynamo Adds Beyond “Split Prefill and Decode”

The easiest mistake is to think Dynamo is just a switch that turns on prefill/decode separation. In practice, Dynamo matters because it handles the coordination work that makes disaggregation usable.

1. KV transfer with NIXL

In disaggregated serving, the prefill result is not just a small control message. It is the KV cache state the decode worker needs in order to continue generation. Dynamo uses NIXL to move that state directly between workers, choosing the best available transport such as NVLink or InfiniBand/UCX. This KV transfer needs to be non-blocking, so GPU forward passes can continue while the transfer happens.

Without an efficient transfer layer, separating prefill and decode can simply shift the bottleneck from GPU contention to data movement overhead.

2. KV-aware Routing

Dynamo’s router does more than round-robin traffic. In KV mode, it evaluates KV cache overlap and load across workers, picking the lowest-cost target and minimizing redundant computation. The router docs explicitly describe both aggregated KV routing and disaggregated KV routing, where the frontend routes to prefill and decode pools based on cache reuse and worker state. 

This is important for multi-turn chat and agentic applications, where repeated prefixes, system prompts, and intermediate reasoning context can be reused instead of recomputed.

3. Runtime-reconfigurable Worker Ratios

Dynamo’s disaggregation design supports runtime-reconfigurable x prefill / y decode worker topologies, with workers joining or leaving dynamically through the discovery service. Its Planner and AIConfigurator workflows are built around selecting configurations that meet SLA targets such as TTFT and ITL while maximizing throughput. The DGDR workflow is now documented as the primary interface for requesting deployments with performance constraints. 

Not every workload needs the same prefill/decode balance all day. A platform can start with one ratio and then adapt as the traffic mix changes.


When is Dynamo a Good Fit

Not every workload benefits from disaggregation. For many standard chatbot or moderate RAG workloads, aggregated serving is still the simpler and often better default.

Important

Dynamo keeps aggregated topology as a first-class deployment mode, with KV-aware routing available even when prefill and decode stay together. 

Disaggregation becomes more compelling when one phase dominates:

  • Long-context input-heavy workloads where prefill is the bottleneck
  • Long-generation or reasoning-heavy workloads where decode dominates
  • High-concurrency mixed workloads where isolating the phases improves tail latency and utilization
  • Agentic pipelines where cache reuse and dynamic routing deliver compounding benefits

NVIDIA’s AIConfigurator and DGDR workflows are designed specifically to help determine the right configuration automatically rather than relying on hand-tuned ratios. 

Info

A practical rule is still the same: start simple, measure, then disaggregate when interference, queue buildup, or poor GPU efficiency becomes visible.


Why Kubernetes for Disaggregated Inference

Once you adopt Dynamo, inference stops looking like a single deployment and starts looking like a graph of coordinated services.

The Dynamo architecture on Kubernetes includes an operator that reconciles DynamoGraphDeployment resources, Kubernetes-native discovery built on DynamoWorkerMetadata and EndpointSlices, worker groups modeled for scaling and placement.

Dynamo docs also describe router/frontend replication, request migration, and other resilience features aimed at real production scenarios. That is why Kubernetes fits so naturally here:

Declarative Lifecycle Management

Dynamo’s operator-driven model maps well to Kubernetes custom resources. Teams can define performance goals and deployment intent declaratively, and let the operator and Planner turn that into a working topology. 

Separate Scaling Domains

Prefill, decode, frontend, router, and support services are not the same kind of workload. Kubernetes lets them live in separate deployments, node pools, and autoscaling policies, which is exactly what disaggregated inference requires. 

Fault Tolerance and Rolling Updates

Inference at cluster scale needs more than pod restart semantics. Dynamo’s docs call out router/frontend replicas, request migration, and rolling update strategies as part of the production picture. Kubernetes gives the surrounding mechanisms for safe upgrades, restarts, and recovery.

Placement and Proximity Control

Disaggregated inference only works well when data movement is fast and predictable. Kubernetes plus Grove provides the scheduling primitives to place communicating workers near each other and scale them as coordinated groups.  


Conclusion

Disaggregated inference is only the starting point. What matters in production is the system around it: how requests are routed, how KV state is moved, how workers are placed, how SLAs are maintained, and how the entire topology is operated reliably over time.

As models get larger, context windows get longer, and agentic workflows generate more irregular traffic, the bottleneck moves from raw GPU availability to how intelligently the platform coordinates GPUs, memory, and network topology.

In that world, disaggregated inference is not just a serving trick. It is a platform architecture. NVIDIA Dynamo brings together the primitives needed to make disaggregated inference operational at cluster scale: NIXL for transfer, KV-aware routing, dynamic worker planning, DGDR-based deployment workflows, and Kubernetes orchestration. 

In the next blog, we will describe how Rafay's Token Factory seamlessly integrates with NVIDIA Dynamo.