Capabilities of the Bare Metal GPU Service¶

The Bare Metal GPU Service provides a way to consume powerful, pre-configured physical machines that are optimized for advanced AI/ML workloads. These nodes offer full access to GPUs, CPUs, memory, storage, and networking resources with no virtualization overhead.

Key Capabilities¶

The following capabilities are supported as part of the Bare Metal GPU Service:

Capability	Description
Multi-GPU Support	Enables usage of nodes with 1, 4, or 8 high-performance GPUs for scale-out training and inference workloads.
Kubernetes Integration	Supports Kubernetes-native workflows; users can deploy workloads using standard manifests and Helm charts.
Custom OS Images	Ability to boot bare metal nodes with pre-approved base operating systems such as Ubuntu 22.04 LTS.
GPU Sharing (Optional)	Offers full node access, but can also support GPU sharing configurations when enabled at cluster level.
High-Speed Interconnects	Nodes are equipped with NVLink, NVSwitch, and NDR Infiniband for high-bandwidth GPU-to-GPU communication.
Dedicated CPU Nodes	Allows provisioning of CPU-only nodes for non-GPU workloads such as orchestration, preprocessing, or storage.
User-Controlled Lifecycle	End users can start, stop, and terminate nodes through self-service controls with quota enforcement.
Custom Initialization Hooks	Supports bootstrap scripts and environment-specific initialization logic.
Telemetry & Monitoring	Integration with monitoring dashboards and system metrics for observability (requires setup).
Networking & Security	Supports workload isolation through Kubernetes namespaces, CNI-based policies, and secure ingress/egress.
No Virtualization Overhead	Direct access to hardware ensures maximum performance for demanding AI/ML pipelines.

Supported Workloads¶

The service is optimized for:

Large Language Model (LLM) training and fine-tuning
Multi-GPU distributed training jobs
High-throughput inference pipelines
Data preprocessing and feature engineering
Serving orchestration or control plane components

Access Patterns¶

Users can consume bare metal resources through:

Compute Profiles with the baremetal type
Environment Templates mapped to supported node types
Custom Providers to inject hooks, data, and logic into provisioning workflows

Platform Setup Overview¶

The platform team is responsible for the initial configuration and enablement of the Bare Metal GPU Service. This setup includes onboarding physical nodes into the Rafay platform, defining system-level resource pools (such as public IP pools and VLANs), configuring networking interfaces (including DPUs), and enabling self-service compute profiles for specific projects.

The architecture typically involves physically provisioned servers with GPU and CPU roles, high-speed interconnects (e.g., NVLink, NDR InfiniBand), and secure tenant-facing network configurations. The platform ensures these resources are exposed to end users in a controlled and quota-enforced manner.

The following sequence diagram outlines the high-level process for preparing the platform for bare metal consumption:

sequenceDiagram
    participant Admin as NCP-Admin
    participant Infra as Bare Metal Infrastructure
    participant Rafay as Rafay Platform

    Admin->>Infra: Rack & Provision Bare Metal Servers
    Admin->>Infra: Install Base OS (e.g., Ubuntu 22.04 LTS)

    Admin->>Infra: Setup Networking (VLANs, IP Pools, DPU Config)
    Admin->>Infra: Attach High-Speed Storage (e.g., NVMe, Ceph)

    Admin->>Rafay: Register Bare Metal Node Resources
    Rafay-->>Infra: Perform Hardware Discovery and Validation

    Admin->>Rafay: Configure Compute Profiles (baremetal type)
    Admin->>Rafay: Setup Environment Templates and Custom Init Hooks

    Admin->>Rafay: Provision Workload Environments using Bare Metal Nodes
    Rafay->>Infra: Bootstrap Kubernetes, System Components, GPU Drivers

    Rafay-->>Admin: Nodes Ready for AI/ML Workloads

Supported Integrations¶

Integration	Availability
GitOps Workflows	✅ Supported
Service Account Injection	✅ Supported
Container Runtime Options (e.g., Kata)	⚙️ Configurable (on request)
GPU Monitoring Dashboards	⚙️ Requires setup
Storage Plugins (e.g., CSI)	✅ Supported

Summary¶

The Bare Metal GPU Service is designed for users who need full hardware access and control to maximize AI/ML performance. It supports highly parallelized workloads with multiple GPUs, dedicated networking, and deep customization for training pipelines and inference systems.