Skip to content

Self-Service Slurm Clusters on Kubernetes with Rafay GPU PaaS

In the previous blog, we discussed how Project Slinky bridges the gap between Slurm, the de facto job scheduler in HPC, and Kubernetes, the standard for modern container orchestration.

Project Slinky and Rafay’s GPU Platform-as-a-Service (PaaS) combined provide enterprises and cloud providers with a transformative combination that enables secure, multi-tenant, self-service access to Slurm-based HPC environments on shared Kubernetes clusters. Together, they allow cloud providers and enterprise platform teams to offer Slurm-as-a-Service on Kubernetes—without compromising on performance, usability, or control.

Design


How It Works: Rafay + Slinky

The architecture diagram above illustrates a shared Kubernetes cluster where multiple end users can run independent Slurm workloads using Rafay’s GPU PaaS. This approach ensures users get a consistent and powerful HPC-like experience, without any infrastructure overhead.

1. User Isolation via Namespaces

Each user (e.g., a research scientist, ML engineer, or analytics team) operates in their own Rafay-managed Kubernetes namespace, ensuring strong tenant isolation.

2. Self-Service Provisioning

Using Rafay’s intuitive console or APIs, users can provision their own personal Slinky-enabled Slurm cluster within their namespace, complete with GPU access and workload scheduling logic.

3. Rafay Cluster Blueprint

The shared, multi-tenant, host Kubernetes cluster is configured using a Rafay Cluster Blueprint, which encapsulates the following add-ons:

  • The Slinky Slurm Operator Add-on to manage Slurm components
  • GPU-specific add-ons (e.g., NVIDIA drivers, device plugin, monitoring agents)
  • Logging, observability, and policy integrations

4. GPU Infrastructure Abstraction

Rafay GPU PaaS abstracts the complexity of managing GPU-backed worker nodes, enabling seamless autoscaling, intelligent GPU placement, and quota enforcement.

5. Shared Control Plane

All of this runs on a shared Kubernetes cluster, managed either by the enterprise platform team or a cloud provider—optimized for cost and performance.

Watch a video of the end user, self service experience


Use Case: Supporting 100s of Researchers at a University

A large research university needs to support hundreds of professors, graduate students, and lab teams, all of whom require Slurm clusters for their work—ranging from genomics and computational physics to deep learning and BioInformatics. Traditionally, this meant managing multiple bare-metal Slurm clusters or maintaining a massive monolithic system prone to resource contention and administrative complexity.

With Rafay GPU PaaS and Project Slinky:

  • Each researcher or lab is provided with their own namespace and self-service Slurm environment.
  • Users can request GPU or CPU resources dynamically, based on their research needs.
  • Quotas and limits ensure fair usage and cost control across departments.
  • The university’s central IT team manages one underlying Kubernetes cluster, simplifying upgrades, monitoring, and security.
  • Researchers get faster access to compute, which accelerates innovation and publication timelines.

This setup empowers researchers with flexible, performant infrastructure—without requiring HPC expertise or operations overhead.


Key Benefits

🎛️ Self-Service, Developer-Friendly

Researchers and ML teams can launch their own GPU-accelerated Slurm clusters in minutes. No manual provisioning, no tickets—just an on-demand experience that accelerates time-to-compute.

🧩 Multi-Tenancy with Security and Compliance

Namespaces, policies, and RBAC enforcement ensure strong isolation. Rafay’s governance features make it easy to meet compliance needs in regulated environments.

🚀 Performance and Flexibility

With Rafay GPU PaaS, workloads get access to the right GPU resources—whether for training, inference, or scientific simulations. Rafay supports multi-GPU nodes, node affinity, and dynamic quota management.

📦 Operational Simplicity at Scale

Rafay abstracts the complexities of managing Kubernetes and GPUs, handling cluster lifecycle, upgrades, autoscaling, and observability out of the box.


Conclusion

By combining the power of Slurm with the flexibility of Kubernetes and the manageability of Rafay GPU PaaS, organizations and cloud providers can now deliver HPC-grade, GPU-accelerated Slurm clusters as a service.

Whether you’re a university supporting hundreds of researchers, a cloud provider delivering managed AI compute, or an enterprise enabling data science teams, Rafay and Slinky make HPC modernization simple, secure, and scalable.

It’s cloud-native. It’s multi-tenant. It’s self-service. And it’s finally possible.