Introduction to Slurm-The Backbone of HPC¶
This is a 2-part blog series on Slurm. In the first part, we will provide some introductory concepts about Slurm. We are not talking about the fictional soft drink in the world of Futurama. Instead, this blog is about Slurm (Simple Linux Utility for Resource Management), an open-source, fault-tolerant, and highly scalable cluster management job scheduler and resource manager used in high-performance computing (HPC) environments.
Slurm was originally conceptualized in 2002 at Lawrence Livermore National Laboratory (LLNL) and has been actively developed and maintained especially by SchedMD. In this time, Slurm has become the defacto workload manager for HPC with >50% of the Top-500 super computers using it.
Need for Workload Management¶
At its core, Slurm is essentially a workload manager. So, what problems do we expect a workload manager like Slurm to solve?
1. Unified Entry Point¶
A workload manager needs to act as the entry point for users so that they can access the compute resources in the data center.
2. Manage Hardware Resources¶
A workload manager will need to manage (i.e. delegated) hardware resources effectively. It will ensure that all tasks follow specified rules.
3. Scheduling¶
A workload manager will implement scheduling algorithms for allocating resources and executing user workloads.
4. Logic Delegation¶
Use cases such as model training requires meticulous management of a vast stack of hardware and software. It does not make sense for every data scientist to implement and manage the logic in their code. Instead, they can delegate this complexity to a common workload manager such as Slurm.
5. Resource Sharing¶
In order to perform HPC tasks or train a model, you need access to extremely expensive hardware (i.e. servers, GPUs, Infiniband switches, low latency networks, high speed storage etc). A workload manager helps you utilize these resources are shared by all users efficiently.
Architecture & Components¶
Users interact with Slurm by connecting to one of the nodes in the cluster. Unlike Kubernetes, where users can interact with the cluster remotely using the Kubectl CLI, with Slurm, users connect to the cluster via SSH and issue commands there.
Organizations may prefer to setup separate login nodes which will require users to authenticate before they can use Slurm.
In the image below, you can see the various components of Slurm and how they interact with each other.
-
The control plane (i.e. master) in Slurm comprises the controller daemons slurmctld and a database daemon slurmdbd.
-
A SQL database is used to persist job history and statistics. The database is also required for advanced functionality (i.e. QoS, resource allocations) related to multi-tenancy clusters.
-
The nodes classified as slurmd daemons are equivalent to Kubernetes worker nodes.
One of the super powers of Slurm is that it splits a job into many steps which can be configured to be executed either in parallel or sequentially. This parallelism allows complex tasks to be completed quickly.
Where Slurm Shines¶
The Slurm scheduler can handle immense scale and has been battle tested on massive supercomputers. Handling ~10,000 nodes with 100s of jobs/second would be considered normal for Slurm.
User can configure Slurm in a very fine grained manner. For example, Slurm can even distinguish between CPU sockets, cores and hyperthreads.
- As a user, you can select a CPU that is in proximity to a PCI bus (reducing latency)
- Slurm also supports GPU sharding
- Slurm understands network topology (i.e. you can ask your job to use a group of nodes that are closest to each other from a network latency PoV ensuring that data transfer occurs through the least number of switches).
Slurm for the New World¶
Slurm is extremely well-suited for HPC type workloads that are time bound running on infrastructure where capacity is known and fixed. In today's world, the vast majority of modern applications are containerized and operated in Kubernetes. Let's now discuss whether Slurm is a good fit for some of the new use cases required specifically for AI/ML.
Training vs Inference¶
For AI/ML use cases, although Slurm is extremely effective for training, it is not well suited for Inference. often want to do inference as well. Inference doesn’t work well on Slurm, so they have to maintain a separate system that also needs expensive hardware resources.
Identical Nodes Requirement¶
In Slurm, all nodes are required to be identical (i.e. software, OS and configuration). This can be operationally difficult to setup and maintain especially in bare metal environments. Common administrative activity such as OS patching etc may require scheduling downtime.
Native Python Support¶
The user experience in Slurm is primarily via a CLI and via bash scripts. ML engineers that are trained on Python and expect a native Pythonic experience may find the experience jarring because they have to context switch out of their IDE of choice.
Auto Scaling¶
Slurm is primarily designed for fixed scale environments with limited support for auto scaling.
Conclusion¶
In this blog, we discussed the basics of Slurm, its design and components, why it is extremely popular and its similarity to Kubernetes. As Kubernetes becomes pervasive and a defacto default in organizations, the big question in everyone's mind is whether Slurm can be made to work on/with Kubernetes.
In the next blog, we will look how organizations can use Rafay's PaaS with Project Slinky to provide their users with a self-service experience for access to Slurm clusters operating on Kubernetes clusters.
-
Free Org
Sign up for a free Org if you want to try this yourself with our Get Started guides.
-
Live Demo
Schedule time with us to watch a demo in action.