Skip to content

Kubernetes

Automated GPU Health Monitoring with NVIDIA NVSentinel on the Rafay Platform

GPU clusters are expensive and GPU failures are costly. In modern AI infrastructure, organizations operate large fleets of NVIDIA GPUs that can cost tens of thousands of dollars each. When a GPU develops a hardware fault (e.g. a double-bit ECC error, a thermal throttle, or a silent data corruption event), the consequences ripple outward: training jobs fail hours into a run, inference latency spikes, and expensive hardware sits idle while engineers scramble to diagnose the root cause.

Traditional monitoring catches these problems eventually, but rarely fixes them. Diagnosing and remediating GPU faults still requires deep expertise, and remediation timelines are measured in hours or days. For organizations running AI workloads at scale — and especially for GPU cloud providers who must deliver uptime SLAs to their tenants — this gap between detection and resolution translates directly into SLA breaches, lost revenue, and eroded customer trust.

NVIDIA's answer to this challenge is NVSentinel — an open-source, Kubernetes-native system that continuously monitors GPU health and automatically remediates issues before they disrupt workloads.

In this blog, we describe how Rafay integrates with NVSentinel enabling GPU cloud operators and enterprises to deploy intelligent GPU fault detection and self-healing across their entire fleet — consistently, repeatably, and at scale.

Rafay and NVSentinel

OpenClaw and NemoClaw: A Better Way to Consume AI Services Through Token Factory

As AI adoption accelerates, most businesses do not actually want to manage GPU clusters, model serving stacks, or low-level infrastructure. What they want is simple, reliable access to powerful models through tools their teams can use immediately. That is exactly the value of combining OpenClaw and NVIDIA NemoClaw with a service provider’s deployment of Rafay Token Factory.

OpenClaw is the user-facing interface where people interact with models and AI assistants. NemoClaw extends that experience with additional security and control for long-running or always-on agents. In both cases, the user experience can remain simple: connect to the provider, use tokens, and start working.

The complexity of GPUs, inference infrastructure, scaling, and capacity planning stays behind the scenes. OpenClaw is the open-source AI agent platform, while NVIDIA describes NemoClaw as an open-source reference stack for running OpenClaw more safely with policy-based privacy and security guardrails.

OpenClaw with Token Factory

From Docker Image to 1-Click App: Enabling Self-Service for Custom Apps

In the Developer Pods series (part-1, part-2 and part-3), we made a simple point: most users do not want infrastructure. They want outcomes.

They do not want tickets. They do not want YAML. They do not want to think about pods, namespaces, ingress, or DNS. They want a working environment or application, available quickly, through a clean self-service experience. That was the core theme behind Developer Pods: Kubernetes is a powerful engine, but it should not be the user interface.

The next step is just as important: letting end users deploy applications packaged as Docker containers into shared, multi-tenant Kubernetes clusters with a true 1-click experience.

Rafay’s 3rd Party App Marketplace is designed for exactly this. It allows providers to curate and publish containerized apps from Docker Hub, third-party vendors, or open-source communities, package them with defaults, user overrides, and policies, and expose them as a secure, governed self-service experience for users across multiple tenants.

Docker App

OpenClaw on Kubernetes: A Platform Engineering Pattern for Always-On AI

AI is moving beyond chat windows. The next useful form factor is an Always-On AI service that can live behind messaging channels, expose a control surface, invoke tools, and be operated like any other platform workload. OpenClaw is interesting because it is built around that model.

OpenClaw is a Gateway-centric runtime with onboarding, workspace/config, channels, and skills, plus a documented Kubernetes install path for hosting.

For platform teams, that makes OpenClaw more than an AI app. It looks like an AI gateway layer that can be deployed, secured, and managed on Kubernetes using the same operational patterns you would use for internal developer platforms, control planes, or multi-service middleware.

OpenClaw

Developer Pods for Platform Teams: Designing the Right Self-Service GPU Experience

In Part 1, we discussed the core problem: most organizations still deliver GPU access through the wrong abstraction. Developers and data scientists do not want tickets, YAML, and long provisioning cycles. They want a ready-to-use environment with the right amount of compute, available when they need it.

In Part 2, we looked at what that self-service experience feels like for the end user: a familiar, guided workflow that lets them select a profile, launch an environment, and SSH into it in about 30 seconds.

In this part, we shift to the other side of the experience: how platform teams design that experience in the first place. Specifically, we will look at how teams can configure and customize a Developer Pod SKU using the integrated SKU Studio in the Rafay Platform.

SKU in Rafay Platform

Developer Pods: A Self-Service GPU Experience That Feels Instant

In Part 1, we discussed the core problem: most organizations still deliver GPU access through the wrong abstraction. Developers do not want tickets, YAML, and long wait times. They want a working environment with the right tools and GPU access, available when they need it.

In this post, let’s look at the other half of the story: the end-user experience. Specifically, what does self-service actually look like for a developer or data scientist using Rafay Developer Pods?

The answer is simple: a familiar UI, a few guided choices, and a running environment they can SSH into in about 30 seconds.

New Developer Pod

Instant Developer Pods: Rethinking GPU Access for AI Teams

It's the week of KubeCon Europe 2026 in Amsterdam. Much of the conversations will be about Kubernetes, AI and GPUs. Let's have a honest discussion.

We are in 2026 and we’re still handing out infrastructure like it’s 2008. The entire workflow is slow, expensive and wildly inefficient. Meanwhile, your most expensive resource—GPUs—sit idle or underutilized.

The way most enterprises deliver GPU access today is completely misaligned with how developers and data scientists actually work. A developer wants to:

  • Run a PyTorch experiment
  • Fine-tune a model
  • Test a pipeline

What do they get instead?

A ticketing system with a multi day wait time and then finally a bloated VM or an entire bare-metal GPU server

There has to be a better way. This is the first part of a blog series on Rafay's Developer Pods. In this, we will describe why and how many of our customers have completely transformed the way they deliver their end users with a self service experience to GPUs.

Dev Pod

No More SSH: Control Plane Overrides for Rafay MKS Clusters

Customizing a Kubernetes control plane has always been an uncomfortable exercise. You SSH into a master node, carefully edit a static pod manifest, and then hope nothing breaks. With our latest release, we are replacing that workflow entirely. Control Plane Overrides give you a safe, declarative way to customize the API Server, Controller Manager, and Scheduler for MKS (Managed Kubernetes Service) clusters — Rafay's upstream Kubernetes offering for bare metal and VMs — directly from the Rafay Console or cluster specification.

NVIDIA AICR Generates It. Rafay Runs It. Your GPU Clusters, Finally Under Control

Deploying GPU-accelerated Kubernetes infrastructure for AI workloads has never been simple. Administrators face a relentless compatibility matrix i.e. matching GPU driver versions to CUDA releases, pinning Kubernetes versions to container runtimes, tuning configurations differently for NVIDIA H100s versus A100s, and doing all of it differently again for training versus inference.

One wrong version combination and workloads fail silently, or worse, perform far below hardware capability. For years, the answer was static documentation, tribal knowledge, and hoping that whoever wrote the runbook last week remembered to update it.

NVIDIA's AI Cluster Runtime (AICR) and the Rafay Platform represent a new approach — one where GPU infrastructure configuration is treated as code, generated deterministically, validated against real hardware, and enforced continuously across fleets of clusters.

Together, they cover the full lifecycle from first aicr snapshot to production-grade day-2 operations, with cluster blueprints as the critical bridge between the two.

Baton Pass

From Slurm to Kubernetes: A Guide for HPC Users

If you've spent years submitting batch jobs with Slurm, moving to a Kubernetes-based cluster can feel like learning a new language. The concepts are familiar — resource requests, job queues, priorities — but the vocabulary and tooling are different. This guide bridges that gap, helping HPC veterans understand how Kubernetes handles workloads and what that means day-to-day.

SLurm to k8s