Skip to content

Overview

This document outlines the technical design for enabling self-service access to Kubernetes namespaces backed by shared GPU resources. The primary goal is to empower end users—such as data scientists, ML engineers, and application developers—to independently deploy and manage workloads that require GPU acceleration, while maintaining governance, security, and efficient resource utilization.

Learn More About GPU Sharing Strategies

To better understand how to optimize GPU usage in cloud-native environments:

  • 📘 Demystify GPU Sharing: A deep dive into how GPU sharing works under the hood, including multiplexing approaches, device plugins, and isolation mechanisms.
  • 🧠 Select the Right Fractional Strategy: A comparative guide on strategies used by cloud providers for fractional GPU allocation and how to choose the best one based on your workload.

These articles are great starting points if you're building or evaluating fractional GPU infrastructure for ML workloads.


Design

In this solution, every end user will get access to a Kubernetes namespace via self service. The namespace will be automatically configured with access to a GPU resources (either time sliced or MIG).

Design


GPU Sharing Strategies

To accommodate a variety of GPU workloads and maximize hardware efficiency, the platform will support two GPU sharing strategies:

  1. Time-Sliced GPUs:

Enabled using the Nvidia GPU Operator, this approach allows multiple pods to share a single physical GPU concurrently, with access managed at the time-slice level by the NVIDIA driver. This is ideal for lightweight or bursty GPU workloads.

  1. Multi-Instance GPU (MIG):

For compatible NVIDIA GPUs, MIG partitions the device into multiple hardware-isolated instances, each with dedicated compute, memory, and cache resources. This ensures strong performance isolation and is suited for more predictable or latency-sensitive workloads.

Administrators retain control over resource quotas, isolation policies, and access boundaries through Kubernetes RBAC, namespace-level limits, and admission controllers. Users interact with GPU resources via a streamlined self-service portal or GitOps workflows, abstracting infrastructure complexity and ensuring compliance with organizational policies.


End User Benefits

The solution is designed to offer a frictionless experience for end users, ensuring they can access and utilize GPU resources without requiring deep platform knowledge or workflow changes. Two primary benefits enable this seamlessness:

Seamless Access via Self-Service

End users gain access to GPU-enabled Kubernetes namespaces through a self-service model—either via Rafay's Self Service portal. Once access is provisioned:

  • Users can immediately deploy GPU workloads within their dedicated namespace.
  • Resource quotas and access controls are automatically enforced by the platform.
  • There is no need for ticket-based provisioning or admin intervention, accelerating iteration cycles and reducing deployment bottlenecks.

This approach lowers the barrier to entry for AI/ML practitioners and speeds up experimentation and model deployment in shared environments.

No Changes Required to Kubernetes YAML

To further reduce friction, the solution abstracts away the complexity of GPU resource management from end users. Specifically

  • Users do not need to modify their Kubernetes YAML manifests to request special device plugins or annotations for NVIDIA GPUs.
  • End users continue to use the same and familiar resource requests (e.g., nvidia.com/gpu) even though the underlying GPU is shared with multiple users via either "time slicing" or "MIG".

As a result, users can continue to use familiar YAML structures and deployment workflows, while still benefiting from backend GPU scheduling—whether through time-sliced access or MIG-backed isolation.