Why CNCF Kubernetes AI Conformance Matters — and Why Rafay Is Leading the Way¶
The industry finally has a standard for running AI workloads on Kubernetes. Rafay's Managed Kubernetes Service (MKS) has achieved CNCF Kubernetes AI Conformance for v1.35 — here's why that matters for every enterprise and neocloud building on GPU infrastructure.
The Problem: AI on Kubernetes Has a Fragmentation Problem¶
Kubernetes won the container orchestration war years ago. But as organizations have raced to deploy AI/ML workloads — training jobs, inference endpoints, and increasingly agentic workflows — they've hit an uncomfortable reality: not all Kubernetes platforms are created equal when it comes to AI.
AI workloads stress clusters in ways that traditional applications never did. They demand fine-grained control over GPUs and accelerators. They require gang scheduling to prevent distributed training jobs from deadlocking. They need high-performance networking fabrics, specialized autoscaling behaviors, and deep observability into hardware-level metrics like GPU utilization, memory pressure, and interconnect bandwidth.
Until recently, how these capabilities were delivered varied wildly across platforms. What worked on one vendor's Kubernetes didn't necessarily work on another's. Platform engineering teams found themselves writing vendor-specific workarounds, building brittle glue code, and accepting a degree of lock-in that went against everything Kubernetes was supposed to stand for.
The cloud native community needed a shared standard — a way to say, definitively, "this platform is AI-ready."
Enter: CNCF Kubernetes AI Conformance¶
In November 2025, at KubeCon + CloudNativeCon North America in Atlanta, the Cloud Native Computing Foundation (CNCF) launched the Certified Kubernetes AI Conformance Program.
The goal is straightforward: define the capabilities a Kubernetes platform needs to reliably and portably run AI and machine learning workloads, and then certify platforms that meet those requirements.
Think of it as an extension of the existing Kubernetes Conformance program. Base Kubernetes conformance ensures that a platform supports the standard Kubernetes APIs. AI Conformance goes further. It validates that a platform can handle the unique infrastructure demands of training, inference, and agentic AI workloads.
The program is governed by a cross-company working group under SIG Architecture and it's developed entirely in the open via the kubernetes-sigs/ai-conformance project.
What Gets Tested?¶
The conformance checklist — formally known as Kubernetes AI Requirements (KARs) — covers several critical areas:
Accelerator Management.¶
Platforms must support Dynamic Resource Allocation (DRA) for flexible, fine-grained GPU and accelerator requests. They must provide verifiable mechanisms for driver and runtime validation, and support GPU sharing strategies for workloads that don't need a full dedicated device.
Scheduling.¶
Conformant platforms must support gang scheduling (through solutions like Kueue or Volcano) to enable all-or-nothing scheduling for distributed training. This is essential for preventing the resource deadlocks that can waste expensive GPU time.
Autoscaling.¶
If a platform provides cluster autoscaling, it must be capable of scaling node groups with specific accelerator types based on pending workload demands. HorizontalPodAutoscaler must function correctly for GPU-backed pods using custom metrics.
Observability.¶
The platform must expose per-accelerator utilization and memory metrics through standardized, machine-readable endpoints aligned with emerging standards like OpenTelemetry. It must also support Prometheus-format metric collection from AI frameworks and model servers.
Security.¶
Access to accelerators must be properly isolated and mediated through the Kubernetes resource management framework. Containers can't have unauthorized access to GPU hardware outside their allocation.
Operator Support.¶
The platform must support installation and operation of AI-focused Kubernetes operators needed for common frameworks and tooling.
Kubernetes v1.35: Raising the Bar¶
The latest v1.35 requirements, announced at KubeCon Europe in Amsterdam in March 2026, introduced significantly stricter standards. The program has expanded to include validation for agentic workloads — multi-step AI workflows that combine tools, memory, and long-running tasks — and mandates alignment with Kubernetes v1.35 primitives like Stable In-Place Pod Resizing (allowing inference models to adjust resources without restarts) and Workload-Aware Scheduling (avoiding resource deadlocks during distributed training).
Since launch, the program has grown from 18 to 31 certified platforms — a 70%+ surge — and the 2026 roadmap includes automated conformance testing via a specialized Verify Conformance Bot, replacing the current self-assessment model with rigorous third-party validation.
Why Does this Matter ?¶
AI Conformance provides three things that were previously missing:
1. Portability. If your AI application works on one conformant platform, it should work on another. No more rewriting Helm charts, reconfiguring device plugins, or discovering that your model serving stack silently breaks when you move between clouds.
2. Confidence. Conformance gives procurement and engineering teams a clear, vendor-neutral benchmark. When evaluating Kubernetes platforms for AI workloads, you can compare apples to apples — certified platforms have demonstrated the same baseline capabilities.
3. Future-proofing. The conformance standard evolves with Kubernetes releases. Platforms must recertify annually. This means your infrastructure vendor is committed to staying current with the latest AI-relevant primitives, not just shipping a one-time GPU integration and walking away.
Rafay MKS Achieves AI Conformance for v1.35¶
Rafay's Managed Kubernetes Service (MKS) Distribution has achieved CNCF Kubernetes AI Conformance for v1.35, with its submission publicly available in the cncf/k8s-ai-conformance repository.
This is significant, but not surprising if you've been following Rafay's trajectory.
A Track Record of Conformance¶
Rafay has maintained CNCF Kubernetes Conformance for every Kubernetes release it supports — the foundational prerequisite for AI Conformance. With MKS, Rafay delivers upstream Kubernetes on bare metal and VM-based environments, with non-disruptive in-place upgrades, centralized audit logging, and enterprise-grade multi-tenancy. Every release is validated against the full CNCF conformance test suite before it reaches customers.
Achieving AI Conformance for v1.35 means Rafay has gone beyond base Kubernetes. It has demonstrated, with documented evidence reviewed by CNCF, that MKS meets every mandatory requirement across accelerator management, scheduling, autoscaling, observability, security, and operator support for AI workloads.
What This Means in Practice¶
For organizations using Rafay's platform to run AI workloads, the v1.35 AI Conformance certification provides concrete assurance:
Dynamic Resource Allocation (DRA) is fully supported.¶
Teams can make fine-grained GPU requests that go beyond simple device counts — requesting specific GPU models, memory configurations, or sharing strategies through standard Kubernetes APIs.
Gang scheduling works.¶
Distributed training jobs using frameworks like PyTorch or TensorFlow can be scheduled with all-or-nothing guarantees, eliminating the partial-allocation scenarios that waste GPU hours and stall training pipelines.
GPU Autoscaling is Validated.¶
Cluster autoscaler correctly handles node groups with specific accelerator types. HorizontalPodAutoscaler works with custom GPU metrics, enabling inference endpoints to scale based on actual accelerator utilization rather than generic CPU metrics.
Deep GPU observability is Built-In.¶
Per-accelerator utilization, memory usage, temperature, power draw, and interconnect bandwidth are exposed through standardized metrics endpoints, giving operations teams the visibility they need to optimize expensive GPU infrastructure.
Security Isolation is Enforced.¶
GPU access is properly mediated through Kubernetes resource management, ensuring multi-tenant environments maintain strict isolation between workloads.
The Bigger Picture: Rafay's AI Platform Story¶
Rafay's AI Conformance achievement doesn't exist in isolation. It's part of a broader platform story that includes the GPU PaaS Reference Architecture developed with NVIDIA, Token Factory for monetizing token-metered access to AI models, integrated support for NVIDIA NIM and Dynamo disaggregated inference, and an App Marketplace that makes it straightforward to deploy AI tooling like Kueue, the NVIDIA GPU Operator, and inference serving frameworks.
For organizations building GPU-as-a-Service offerings — whether they're enterprise internal platforms, sovereign clouds, or neoclouds — AI Conformance certification means the Rafay platform meets the industry's most rigorous standard for AI infrastructure.
It's not just a Kubernetes distribution that happens to support GPUs. It's a certified AI-ready platform.
Looking Ahead¶
The Kubernetes AI Conformance program is still evolving. Automated testing is coming in 2026, which will move the program beyond self-assessment toward provable, repeatable validation. The community is also working on Sovereign AI standards with enhanced sandboxing and data privacy requirements.
For platform vendors, achieving and maintaining conformance is becoming table stakes. For enterprises evaluating infrastructure for production AI workloads, it's becoming the first question to ask.
Rafay's early achievement of v1.35 AI Conformance — at the strictest level the program has yet defined — signals a clear commitment: as the standards for AI on Kubernetes continue to rise, Rafay intends to stay ahead of them.
-
Free Org
Sign up for a free Org if you want to try this yourself with our Get Started guides.
-
Live Demo
Schedule time with us to watch a demo in action.
