Self-Service Fractional GPUs with Rafay GPU PaaS¶

Enterprises and GPU Cloud providers are rapidly evolving toward a self-service model for developers and data scientists. They want to provide instant access to high-performance compute — especially GPUs — while keeping utilization high and costs under control.

Rafay GPU PaaS enables enterprises and GPU Clouds to achieve exactly that: developers and data scientists can spin up resources such as Developer Pods or Jupyter Notebooks backed by fractional GPUs, directly from an intuitive self-service interface.

This is Part-1 in a multi-part series on end user, self service access to Fractional GPU based AI/ML resources.

Simplifying Access to Fractional GPUs¶

Traditionally, provisioning GPU resources for experimentation or model training meant filing a ticket and waiting for IT or platform teams to allocate hardware. That process doesn’t work in modern, agile AI environments.

When GPUs are allocated as fractions, multiple workloads can share a single physical GPU safely and efficiently. With Rafay GPU PaaS, these fractional GPU classes are abstracted into simple drop-down options such as 0.25 GPU or 0.5 GPU,” making them accessible to any developer without needing to understand underlying GPU profiles, MIG instances, or time-slicing policies. These users also do not require any form of privileged and administrative access to the underlying infrastructure.

Developer Pods: A Full Linux Workspace in Seconds¶

A developer simply fills in a few intuitive fields:

Name and Description
CPU (e.g. 500 Mi) and Memory (e.g. 512 MiB)
GPU Fraction (e.g. 0.25)
Base images (e.g. Ubuntu 24.04, preloaded with CUDA, Python)

Based on the user's selection, Rafay dynamically calculates and provides a real-time cost estimate. This provides users with immediate visibility and accountability — i.e. users know what their environment costs before launching it.

With one click, Rafay provisions the pod, applying all necessary Kubernetes and KAI Scheduler logic behind the scenes:

KAI assigns a fractional GPU slice (e.g., 25% of an A100 or L40S GPU).
Rafay enforces quotas, scheduling policies, and isolation.

Developers connect to the remote Ubuntu instance via SSH.

From the user’s perspective, it feels like getting a dedicated Linux server with a full GPU. In reality, it’s a fraction of a GPU, allocated dynamically and efficiently through KAI. In the example below, we have accessed the remote Ubuntu image via SSH and have run the nvidia-smi command.

Jupyter Notebooks with Fractional GPUs¶

For data scientists, Rafay provides an equally streamlined workflow for self service access to Jupyter notebooks with fractional GPUs. Users provide a name and specify resources interactively. For example:

CPU: e.g., 1000m ($0.25/hour)
Memory: e.g., 4Gi ($0.80/hour)
GPU Fraction: e.g., 0.25 ($0.50/hour)
Base Images (e.g. Minimal, PyTorch, Tensorflow etc)

They can select a Notebook Profile (such as Minimal Environment) preconfigured with Jupyter, TensorFlow, PyTorch, or RAPIDS. Rafay will dynamically calculate and present a cost estimate. This clarity allows teams to track compute costs in real time while experimenting freely.

Once the notebook is launched, Rafay does the following automatically:

Allocate a fractional GPU to the notebook pod.
Provisions ingress, TLS certificates, and authentication automatically.

Once the user clicks on the URL for the Jupyter notebook and authenticates using the token, they land in a secure Jupyter environment ready to run GPU-accelerated code. This provides all the convenience of managed notebook services — without vendor lock-in and with full control over costs, policy, and compliance.

In the image below, we have launched the "Terminal" in the Jupyter notebook and typed the "nvidia-smi" command to check what kind of GPU resources are available.

Benefits¶

The benefits for users are apparent. This experience equals rapid iteration — running experiments, tuning models, and testing code in minutes rather than days.

Instant Access¶

Users do not have to wait on approvals or infrastructure provisioning.

Right-Sized Resources¶

Developers choose exactly the fraction of GPU needed for their workload.

Predictable Costing¶

Transparent per-hour and monthly estimates displayed in the UI. Integrated usage tracking enabling platform teams and GPU Cloud providers to bill users.

Familiar Environments¶

Users can choose from list of prebuilt and ready to use ML frameworks making them immediately productive.

Behind the Scenes¶

Rafay GPU PaaS integrates with multiple frameworks/tooling behind the scenes. Users are provided with same seamless, self service experience immaterial of the underlying technology used. Some common options and key considerations are described below.

NVIDIA MIG¶

Rafay integrates with MIG. Hardware-based fractioning technology like NVIDIA’s Multi-Instance GPU (MIG) delivers improved isolation through dedicated fractional GPU partitions.

Key Considerations¶

MIG requires static partitioning of the GPU with pre-defined profiles which admins may find restrictive.
MIG is supported only on advanced and expensive datacenter class GPU models.

NVIDIA KAI Scheduler¶

Rafay integrates with KAI Scheduler for scenarios where MIG is not supported on the GPUs.

Key Considerations¶

KAI leverages “time slicing” to implement sharing. With KAI, you can implement various types of sharing: "Fractional GPUs", "Fractional GPU Memory", "Queues" etc.

Although it offers flexible GPU sharing, it lacks memory isolation. For example, in the notebook example above, although we requested a fraction of the GPU for the pod, we can actually see all 23GB GPU memory.

As documented, "In order to make sure the pods share the GPU device nicely, it is important that the running processes will allocate GPU memory up to the requested amount and not beyond that".

The New Normal for Seamless GPU Access¶

With Rafay GPU PaaS, both enterprises and GPU Clouds can transform their GPU clouds into fully self-service, multi-tenant environments. Users no longer need to understand MIG profiles or GPU topology — they simply choose a fractional GPU size and launch. Whether running a Developer Pod for code exploration or a Jupyter Notebook for model experimentation, users get:

GPU-backed environments in minutes,
Fractional GPU allocation at scale,
Simplified developer access via intuitive UI
Visibility into costs,
Scalable, shareable infrastructure that maximizes ROI.

The result: a GPU cloud where developers and data scientists move faster — while every GPU dollar works harder.

In the next blog, we will update this to provide users with the ability to select fractional GPU memory via self service. So, instead of getting a fraction such as 25% of a GPU, they can select exactly how much GPU memory should be allocated for their resource.

Free Org

Sign up for a free Org if you want to try this yourself with our Get Started guides.

Free Org
Live Demo

Schedule time with us to watch a demo in action.

Schedule Demo