Setup

The following sections outline the steps needed to setup the SLURM template.

Below are the minimum required settings for a reliable, functioning environment. Users can update these settings to scale the environment based on the performance/capacity required.

Minimum Requirements¶

We have tested extensively with various types of deployments. We recommend using at least the minimal configuration described below for a reliable, stable system.

Servers¶

A minimum of a single server with the below specifications is required. Additional servers can be used to scale the size of the cluster as needed.

Requirement	Details
Server OS	Ubuntu 22.04
CPU	16 Cores
Memory	64 GB
Storage (Shared StorageClass with RWX)	200 GB
1 Public IP Address	Ports 80, 443 and 30000-32767 open

Storage¶

The SLURM cluster nodes (login and compute nodes) will provide users with access to a shared file system. Ensure a CSI is installed and configured on the host Kubernetes cluster with a StorageClass with RWX access. For example, Rook-Ceph using a shared filesystem is a good option. Rook-Ceph can be installed using a custom cluster blueprint. An example blueprint will be provided later in this document. If this example blueprint is to be used for Rook-Ceph storage, the server must be configured with at least one raw block device with a minimum of 200GB.

Ports¶

Ensure ports 80, 443 and 30000-32767 are open on the Public IPs of K8s cluster. We will use NodePort for users to login into the SLURM cluster over SSH.

DNS¶

Configure the Ingress Controller with TLS certificates applied for a domain that will be used to present the user with a URL to access the Grafana based monitoring dashboard. Ensure that DNS is mapped to a wildcard for the domain directed to the Clusters public IPs.

Cluster Add-Ons¶

The following software add-ons need to be installed and configured on the host cluster. These are typically packaged as a Rafay Cluster Blueprint for consistency and repeatability.

Ingress Controller¶

Install an Ingress Controller (e.g. nginx) on the host cluster.

GPU Operator¶

Install the GPU operator on the host cluster. For example, Nvidia's GPU Operator.

Storage CSI¶

Install a Storage CSI to provide a StorageClass with RWX storage. For example, Rook-Ceph.

Add Servers to Inventory¶

First, the user must add the pre-provisioned servers to the console's inventory management layer. Once the servers are added into inventory, they will be available to be used by SKUs within the platform.

Login to the Ops Console
Navigate to Inventory -> Data Centers
Select the Datacenter to add the servers into
Navigate to the Servers tab
Click Add Server
On the Properties tab, enter the following fields under Basic Information
Name - The name of the server
Allocation Status - Set to Available as this is a new server and has not been allocated yet
Enter the following fields under Authentication
Username - The username to access the server via SSH
Password - The password for the user (OPTIONAL if SSH key is provided)
SSH Key - The Private SSH key to access the server (OPTIONAL if password is provided)
Enter the following fields under IP Management
Public IP - The Public IP address of the server
Private IP - The Private IP address of the server

Navigate to the Tags tab
Click Add Tags under the Tags section
Enter the Key gpu_type
Enter a value for the GPU type, like "H200". Note, this value must match the value used within the SKU during deployment.
Optionally you can add a tag with the key public_ip if the IP to access the SLURM nodes will be different than the server SSH management IP. This IP address will be used as the public IP address used to access the SLURM nodes. SSH access to the underlying server will still use the public IP address set previously in the IP Management section.
Click Add Server

Configure Profile¶

In this section, verify the SLURM template is loaded into the organization.

Navigate to PaaS Studio -> Compute Profiles
You should see a profile named SLURM as a Service

Info

If you do not see this profile loaded, contact Rafay Support to assist in loading the template.

Click on the profile name to open the profile
Under the Input Settings section, configure the variable values to match your environment. The following values should be updated at a minimum:
API Key
Blueprint Name
Blueprint Version
Controller Endpoint
Domain
GPU Type
Shared Storageclass Name
Storageclass Name
Click Save Changes

Configure Dimension Based Pricing¶

Next, you will configure dimension based pricing on an existing profile within your default organization. Pricing can be configured within the Global Settings of the default Org.

In your Default Org, navigate to System -> Global Settings
Add YAML similar to the following being sure to update the prices specific to your environment. The dimension names must match the name of input parameters within the profile.
Click Save

The base_unit is the dividing number to determine the billable units. As an example, if the user selects 2 for the No of Nodes and the base_unit is 1, the user will be billed at 2x the price rate (2/1 = 2). The time_unit can be either m, h or d (minute/hour/day).

- name: mks-oneclick-slurm
  billing:
    currency:
    - USD
    dimensions:
    - GPU Type
    - No Of Nodes
    ratecard:
      No Of Nodes:
      - price: 2
        time_unit: h
        base_unit: 1
        currency: USD
      GPU Type:
      - price: 1
        time_unit: h
        base_value: 1
        count_from: "GPU Count"
        currency: USD
        value: A40
      - price: 1.5
        time_unit: h
        base_value: 1
        count_from: "GPU Count"
        currency: USD
        value: H100
      - price: 2.5
        time_unit: h
        base_value: 1
        count_from: "GPU Count"
        currency: USD
        value: H200

From Developer Hub, the user will now see pricing details when deploying the SKU.