Skip to content

Nvidia

NVIDIA Performance Reference Architecture: An Introduction

Artificial intelligence (AI) and high-performance computing (HPC) workloads are evolving at unprecedented speed. Enterprises today require infrastructure that can scale elastically, provide consistent performance, and ensure secure multi-tenant operation. NVIDIA’s Performance Reference Architecture (PRA), built on HGX platforms with Shared NVSwitch GPU Passthrough Virtualization, delivers precisely this capability.

This is the introductory blog in a multi part series. In this blog, we explain why PRA is critical for modern enterprises and service providers, highlight the benefits of adoption, and outline the key steps required to successfully deploy and support the PRA design/architecture.

Deep Dive into nvidia-smi: Monitoring Your NVIDIA GPU with Real Examples

Whether you're training deep learning models, running simulations, or just curious about your GPU's performance, nvidia-smi is your go-to command-line tool. Short for NVIDIA System Management Interface, this utility provides essential real-time information about your NVIDIA GPU’s health, workload, and performance.

In this blog, we’ll explore what nvidia-smi is, how to use it, and walk through a real output from a system using an NVIDIA T1000 8GB GPU.


What is nvidia-smi?

nvidia-smi is a CLI utility bundled with the NVIDIA driver. It enables:

  • Real-time GPU monitoring
  • Driver and CUDA version discovery
  • Process visibility and control
  • GPU configuration and performance tuning

You can execute it using:

nvidia-smi

Choosing Your Engine for LLM Inference: The Ultimate vLLM vs. TensorRT LLM Guide

This is the next blog in the series of blogs on LLMs and Generative AI. When deploying large language models (LLMs) for inference, it is critical to consider: efficiency, scalability, and performance. Users will likely be very familiar with two market leading options: vLLM and Nvidia's TensorRT LLM.

In this blog, we dive into their pros and cons, helping users select the most appropriate option for their use case.

vLLM vs TensorRT LLM

Fractional GPUs using Nvidia's KAI Scheduler

At KubeCon Europe, in April 2025, Nvidia announced and launched the Kubernetes AI (KAI) Scheduler. This is an Open Source project maintained by Nvidia.

The KAI Scheduler is an advanced Kubernetes scheduler that allows administrators of Kubernetes clusters to dynamically allocate GPU resources to workloads. Users of the Rafay Platform can immediately leverage the KAI scheduler via the integrated Catalog.

KAI in Catalog

To help you understand the basics quickly, we have also created a brief video introducing the concepts and a live demonstration showcasing how you can allocate fractional GPU resources to workloads.

Spatial Partitioning of GPUs using Nvidia MIG

In the prior blogs, we discussed why GPUs are managed differently in Kubernetes, how the GPU Operator helps streamline management and various strategies to share GPUs on Kubernetes. In 2020, Nvidia introduced Multi-Instance GPU (MIG) that takes GPU sharing to a different level.

In this blog, we will start by reviewing some common industry use cases where MIG is used and then dive deeper into how MIG is configured and used.

Nvidia MIG

GPU Sharing Strategies in Kubernetes

In the previous blogs, we discussed why GPUs are managed differently in Kubernetes and how the GPU Operator can help streamline management. In Kubernetes, although you can request fractional CPU units for workloads, you cannot request fractional GPU units.

Pod manifests must request GPU resources in integers which results in an entire physical GPU allocated to one container even if the container only requires a fraction of the resources. In this blog, we will describe two popular and commonly used strategies to share a GPU on Kubernetes.

GPU Sharing in Kubernetes

GPU Metrics - SM Clock

In the previous blog, we discussed why tracking and reporting GPU Memory Utilization metrics matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU SM Clock. The GPU SM clock (Streaming Multiprocessor clock) metric refers to the clock speed at which the GPU's cores (SMs) are running.

The SM is the main processing unit of the GPU, responsible for executing compute tasks such as deep learning operations, simulations, and graphics rendering. Monitoring the SM clock speed can help users assess the performance and health of your GPU during workloads and detect potential bottlenecks related to clock speed throttling.

GPU SM Clock

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

GPU Metrics - Memory Utilization

In the introductory blog on GPU metrics, we discussed about the GPU metrics that matter and why they matter. In this blog, we will dive deeper into one of the critical GPU metrics i.e. GPU Memory Utilization.

GPU memory utilization refers to the percentage of the GPU’s dedicated memory (i.e. framebuffer) that is currently in use. It measures how much of the available GPU memory is occupied by data such as models, textures, tensors, or intermediate results during computation.

GPU Memory Utilization

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.