Naveen Chakrapani¶

March 25, 2026
in Product Blog, GPU, NVIDIA
4 min read

Running GPU Infrastructure on Kubernetes: What Enterprise Platform Teams Must Get Right

KubeCon + CloudNativeCon Europe 2026, Amsterdam

If you are at KubeCon this week in Amsterdam, you are likely hearing the same question repeatedly: how do we actually operate GPU infrastructure on Kubernetes at enterprise scale? The announcements from NVIDIA — the DRA Driver donation, the KAI Scheduler entering CNCF Sandbox, GPU support for Kata Containers expand what is technically possible. But for enterprise platform teams, the harder problem is not capability. It is operating GPU infrastructure efficiently and responsibly once demand arrives.

This post is written for platform teams building internal GPU platforms — on-premises, in sovereign environments, or in hybrid models. You are not just provisioning infrastructure. You are governing access to some of the most expensive and constrained resources in the organization.

At scale, GPU inefficiency is not accidental. It is structural:

Idle GPUs that remain allocated but unused
Over-provisioned workloads consuming more than needed
Fragmented capacity that cannot satisfy real workloads
Lack of cost visibility and accountability

Solving this requires more than infrastructure. It requires a governed platform model.

March 25, 2026
in Product Blog, GPU, NVIDIA
3 min read

Advancing GPU Scheduling and Isolation in Kubernetes

KubeCon + CloudNativeCon Europe 2026, Amsterdam

At KubeCon Europe 2026, NVIDIA made a set of significant open-source contributions that advance how GPUs are managed in Kubernetes. These developments span across: resource allocation (DRA), scheduling (KAI), and isolation (Kata Containers). Specifically, NVIDIA donated its DRA Driver for GPUs to the Cloud Native Computing Foundation, transferring governance from a single vendor to full community ownership under the Kubernetes project. The KAI Scheduler was formally accepted as a CNCF Sandbox project, marking its transition from an NVIDIA-governed tool to a community-developed standard. And NVIDIA collaborated with the CNCF Confidential Containers community to introduce GPU support for Kata Containers, extending hardware-level workload isolation to GPU-accelerated workloads. Together, these contributions move GPU infrastructure closer to a first-class, community-owned, scheduler-integrated model.

March 23, 2026
in Product Blog, GPU, NVIDIA
4 min read

Flexible GPU Billing Models for Modern Cloud Providers — Powering the AI Factory with Rafay

The GPU cloud market is evolving fast. At NVIDIA GTC 2026, one theme rang loud and clear: enterprises are no longer experimenting with AI, they are committing to it at scale. Training frontier models, fine-tuning domain-specific LLMs, and running large-scale inference workloads on NVIDIA gear require sustained, predictable access to high-end GPU infrastructure. That kind of commitment demands a billing model to match.

If you are running a GPU cloud business, you already know that a simple pay-as-you-go model doesn't cut it anymore. Your enterprise customers want options and your ability to offer those options is a direct competitive advantage. That's where Rafay comes in.

March 16, 2026
in Product Blog, Cost Management, Pod Resizing
3 min read

Stop Paying for Resources Your Pods Don't Need

If you manage Kubernetes infrastructure at scale, you already know the pattern. Development teams request CPU and memory "just to be safe." Nobody wants their app to OOM. Nobody wants to get paged at 2am because a pod got throttled. So requests get padded and they stay padded.

The result? Clusters are full of pods consuming far less than what they've been allocated. Nodes are running hot on paper but idle in practice. And the platform team responsible for cost governance across dozens of clusters, projects, and namespaces has no easy way to prove it.

September 5, 2025
in Product Blog
2 min read

Support for Parallel Execution with Rafay's Integrated GitOps Pipeline

At Rafay, we are continuously evolving our platform to deliver powerful capabilities that streamline and accelerate the software delivery lifecycle. One such enhancement is the recent update to our GitOps pipeline engine, designed to optimize execution time and flexibility — enabling a better experience for platform teams and developers alike.

Integrated Pipeline for Diverse Use Cases

Rafay provides a tightly integrated pipeline framework that supports a range of common operational use cases, including:

System Synchronization: Use Git as the single source of truth to orchestrate controller configurations
Application Deployment: Define and automate your app deployment process directly from version-controlled pipelines
Approval Workflows: Insert optional approval gates to control when and how specific pipeline stages are triggered, offering an added layer of governance and compliance

This comprehensive design empowers platform teams to standardize delivery patterns while still accommodating organization-specific controls and policies.

From Sequential to Parallel Execution with DAG Support

Historically, Rafay’s GitOps pipeline executed all stages sequentially, regardless of interdependencies. While effective for simpler workflows, this model imposed time constraints for more complex operations.

With our latest update, the pipeline engine now supports Directed Acyclic Graphs (DAGs) — allowing stages to execute in parallel, wherever dependencies allow.

September 4, 2025
in Product Blog
2 min read

Important Update: Changes to Bitnami Public Catalog

Recently, Bitnami announced significant changes to its container image distribution here. As part of this update, the Bitnami public catalog (docker.io/bitnami) will be permanently deleted on September 29^th.

What’s Changing

All existing container images (including older or versioned tags such as 2.50.0, 10.6, etc.) will be moved from the public catalog (docker.io/bitnami) to a Bitnami Legacy repository (docker.io/bitnamilegacy).
The legacy catalog will no longer receive updates or support. It is intended only as a temporary migration solution to give users time to transition.

September 2, 2025
in Product Blog, Agents
2 min read

Simplifying Day-2 Operations with Agent Pools

Implementing Day-2 Operations such as agent replacement is cumbersome today because every configuration tied to a previous agent must be reconfigured manually. This makes tasks like scaling, retiring agents, or handling failures both error-prone and time-consuming.

To address this pain point, we are introducing the concept of an Agent Pool.

Why Agent Pools?

Instead of binding configurations directly to individual agents, customers can now attach multiple agents to a shared Agent Pool. Configurations such as Environment Templates and Resource Templates reference the pool, rather than a single agent.

This simple shift brings significant operational benefits:

Seamless Failover and Replacement
Add or remove agents from a pool without reconfiguring existing associations.
Simplified Day-2 Operations
Manage scaling, upgrades, and retirements without disruption.
Load Balancing
Distribute load across multiple agents within a pool for higher availability and performance.

August 11, 2025
in Product Blog
3 min read

Enhancing Namespace Chargeback Reports with Custom Label-Based Metadata in Rafay

In the world of FinOps, precise cost allocation is more than just a “nice to have”, it’s the foundation for accurate chargeback, accountability, and informed decision-making. With Rafay’s latest release, Chargeback Summary Reports aggregated by namespace now support custom label-based metadata enrichment.

This enhancement empowers FinOps teams to add business-relevant metadata (like team or cost_center) directly into their cost reports making it easier to trace expenses to the right owners and justify resource consumption.

Why This Matters for FinOps

In large, multi-tenant Kubernetes environments, namespaces often represent workloads owned by different teams, applications, or business units. Without enriched metadata, a FinOps practitioner might see “Namespace A” incurring costs, but need extra steps to figure out which team or cost center is responsible.

Now, you can define specific label keys (e.g., team, cost_center) in the chargeback report configuration, and Rafay will automatically include them as additional columns in the report—populated with values from the namespace labels. This directly embeds organizational context into your cost visibility.

Note:
This enhancement applies to namespace-based aggregation in chargeback reports (not namespace-label-based aggregation). This is because if a primary label value (e.g., cost_center) is the same across multiple namespaces but secondary label values (e.g., team) differ, the report will not be able to aggregate on primary labels in such cases.

June 23, 2025
in Product Blog, Approvals, Compliance
3 min read

Enforcing ServiceNow-Based Approvals with Rafay

Enterprises often require explicit approvals before critical actions can proceed especially when provisioning infrastructure or making configuration changes. With Rafay’s out-of-the-box (OOB) workflow handlers, customers can easily integrate with popular ITSM systems such as ServiceNow (SNOW).

Catalog

This post explains how to configure and use Rafay’s ServiceNow Workflow Handler to enforce approval gates.

Workflow Handlers in Rafay

Rafay enables platform teams to attach Workflow Handlers to key actions as pre-hooks or post-hooks:

Pre-hook Handlers: Triggered before an action (e.g., pause provisioning until approval is received)
Post-hook Handlers: Triggered after an action (e.g., notify stakeholders after infrastructure (environment) creation)

Typical Scenarios

Here are a few use cases where ServiceNow-based approvals come into play:

Developers request a vCluster to test their app before raising a PR
Platform admins initiate a Kubernetes upgrade for a fleet of clusters that requires approval

May 18, 2025
in Product Blog, Best Practices, Compliance
4 min read

Simplifying Blueprint and Add-on Management with Draft Versions

Managing infrastructure at scale demands both agility and precision—especially when it comes to version control. At Rafay, we have long supported versioning for key configuration objects such as Blueprints and Add-ons, enabling platform teams to roll out changes systematically and maintain operational consistency.

However, as many teams have discovered, managing these versions during testing and validation phases can introduce unnecessary complexity. We are excited to announce a major usability enhancement: Support for Draft Versions.

Why Versioning Matters

Versioning in Rafay’s platform delivers several key advantages:

Change Tracking: Keep a historical record of changes made to Blueprints and Add-ons over time
Staged Rollouts: Gradually deploy updates across environments and clusters to minimize risk
Compliance Assurance: Demonstrate adherence to organizational policies and track Day-2 changes in a controlled way

These capabilities are especially crucial for teams responsible for maintaining secure, production-grade Kubernetes environments

The Challenge: Version Sprawl During Testing

While versioning is powerful, it has traditionally introduced friction during the testing and validation phase. Each time a platform engineer made a minor change to an Add-on or Blueprint, a new version needed to be created—even if the version wasn’t production-ready.

This led to:

Version fatigue, with large volumes of partially validated versions cluttering the system
Increased manual overhead and inefficiency for platform teams
Risk of accidental usage of incomplete configurations in downstream projects