Setup
The following sections outline the steps needed to setup the SLURM template.
Below are the minimum required settings for a reliable, functioning environment. Users can update these settings to scale the environment based on the performance/capacity required.
Minimum Requirements¶
We have tested extensively with various types of deployments. We recommend using at least the minimal configuration described below for a reliable, stable system.
Servers¶
A minimum of a single server with the below specifications is required. Additional servers can be used to scale the size of the cluster as needed.
| Requirement | Details |
|---|---|
| Server OS | Ubuntu 22.04 |
| CPU | 16 Cores |
| Memory | 64 GB |
| Storage (Shared StorageClass with RWX) | 200 GB |
| 1 Public IP Address | Ports 80, 443 and 30000-32767 open |
Storage¶
The SLURM cluster nodes (login and compute nodes) will provide users with access to a shared file system. Ensure a CSI is installed and configured on the host Kubernetes cluster with a StorageClass with RWX access. For example, Rook-Ceph using a shared filesystem is a good option. Rook-Ceph can be installed using a custom cluster blueprint. An example blueprint will be provided later in this document. If this example blueprint is to be used for Rook-Ceph storage, the server must be configured with at least one raw block device with a minimum of 200GB.
Ports¶
Ensure ports 80, 443 and 30000-32767 are open on the Public IPs of K8s cluster. We will use NodePort for users to login into the SLURM cluster over SSH.
DNS¶
Configure the Ingress Controller with TLS certificates applied for a domain that will be used to present the user with a URL to access the Grafana based monitoring dashboard. Ensure that DNS is mapped to a wildcard for the domain directed to the Clusters public IPs.
Cluster Add-Ons¶
The following software add-ons need to be installed and configured on the host cluster. These are typically packaged as a Rafay Cluster Blueprint for consistency and repeatability.
Ingress Controller¶
Install an Ingress Controller (e.g. nginx) on the host cluster.
GPU Operator¶
Install the GPU operator on the host cluster. For example, Nvidia's GPU Operator.
Storage CSI¶
Install a Storage CSI to provide a StorageClass with RWX storage. For example, Rook-Ceph.
Add Servers to Inventory¶
First, the user must add the pre-provisioned servers to the console's inventory management layer. Once the servers are added into inventory, they will be available to be used by SKUs within the platform.
- Login to the Ops Console
- Navigate to Inventory -> Data Centers
- Select the Datacenter to add the servers into
- Navigate to the Servers tab
- Click Add Server
- On the Properties tab, enter the following fields under Basic Information
- Name - The name of the server
- Allocation Status - Set to Available as this is a new server and has not been allocated yet
- Enter the following fields under Authentication
- Username - The username to access the server via SSH
- Password - The password for the user (OPTIONAL if SSH key is provided)
- SSH Key - The Private SSH key to access the server (OPTIONAL if password is provided)
- Enter the following fields under IP Management
- Public IP - The Public IP address of the server
- Private IP - The Private IP address of the server
- Navigate to the Tags tab
- Click Add Tags under the Tags section
- Enter the Key gpu_type
- Enter a value for the GPU type, like "H200". Note, this value must match the value used within the SKU during deployment.
- Optionally you can add a tag with the key public_ip if the IP to access the SLURM nodes will be different than the server SSH management IP. This IP address will be used as the public IP address used to access the SLURM nodes. SSH access to the underlying server will still use the public IP address set previously in the IP Management section.
- Click Add Server
Configure Profile¶
In this section, verify the SLURM template is loaded into the organization.
- Navigate to PaaS Studio -> Compute Profiles
- You should see a profile named SLURM as a Service
Info
If you do not see this profile loaded, contact Rafay Support to assist in loading the template.
- Click on the profile name to open the profile
-
Under the Input Settings section, configure the variable values to match your environment. The following values should be updated at a minimum:
-
API Key
- Blueprint Name
- Blueprint Version
- Controller Endpoint
- Domain
- GPU Type
- Shared Storageclass Name
-
Storageclass Name
-
Click Save Changes
Configure Dimension Based Pricing¶
Next, you will configure dimension based pricing on an existing profile within your default organization. Pricing can be configured within the Global Settings of the default Org.
- In your Default Org, navigate to System -> Global Settings
- Add YAML similar to the following being sure to update the prices specific to your environment. The dimension names must match the name of input parameters within the profile.
- Click Save
The base_unit is the dividing number to determine the billable units. As an example, if the user selects 2 for the No of Nodes and the base_unit is 1, the user will be billed at 2x the price rate (2/1 = 2). The time_unit can be either m, h or d (minute/hour/day).
- name: mks-oneclick-slurm
billing:
currency:
- USD
dimensions:
- GPU Type
- No Of Nodes
ratecard:
No Of Nodes:
- price: 2
time_unit: h
base_unit: 1
currency: USD
GPU Type:
- price: 1
time_unit: h
base_value: 1
count_from: "GPU Count"
currency: USD
value: A40
- price: 1.5
time_unit: h
base_value: 1
count_from: "GPU Count"
currency: USD
value: H100
- price: 2.5
time_unit: h
base_value: 1
count_from: "GPU Count"
currency: USD
value: H200
From Developer Hub, the user will now see pricing details when deploying the SKU.


