GPU Benchmarks
The SLURM cluster comes preloaded with the NCCL tests. Running these tests will check the networking performance between GPUs on a single node and multiple nodes using NVLink or InfiniBand.
Run NCCL Tests¶
Follow these steps to run a sample NCCL test. Additional tests are available on the node and can be run as well.
- SSH into the Login node
- Create the following batch script with the name nccl_test.sbatch
- Use SCP to copy the file to the /mnt/data directory on the Login node
#!/bin/bash
#SBATCH --job-name=nccl_multi_node
#SBATCH --output=results/nccl_multi_node-%j.out
#SBATCH --error=results/nccl_multi_node-%j.out
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=16
# Use 2 InfiniBand queue pairs per connection between ranks
export NCCL_IB_QPS_PER_CONNECTION=2
# Use NVLink SHARP to offload all-reduce to NVSwitch
export NCCL_NVLS_ENABLE=1
# Double buffer size for NCCL communications
export NCCL_BUFFSIZE=8388608
# Prevent MPI from using InfiniBand
export UCX_NET_DEVICES=eth0
# Run a multi-node MPI NCCL test
srun --mpi=pmix \
all_reduce_perf_mpi -b 512M -e 8G -f 2 -g 1
Now, submit the NCCL test job by running the following command
sbatch --nodes=4 nccl_test.sbatch
NCCL Test Configuration¶
Worker Nodes¶
The number of worker nodes can be customized by setting --nodes=4 when starting the job. Update the number of nodes as needed.
If you run the test on multiple nodes, it uses NVLink for communications between GPUs on the same node and InfiniBand for GPUs on different nodes. To benchmark NVLink specifically, run the test on one node.
The number of of vCPUs for the test can be set by updating the following parameters in the script. Note, that the number of vCPUs used per worker for this test is 4 x 16 = 64 vCPUs on each worker node.
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=16
Environment variables¶
The script uses the following NCCL and UCX environment variables:
- NCCL_IB_QPS_PER_CONNECTION=2 makes each connection between two ranks (GPU processes) use two InfiniBand queue pairs.
- NCCL_NVLS_ENABLE=1 explicitly enables NVLink SHARP (NVLS), which offloads the all-reduce operation to the NVSwitch domain.
- NCCL_BUFFSIZE=8388608 increases the buffer size for NCCL communications between pairs of GPUs from 4 MiB (default) to 8 MiB.
- UCX_NET_DEVICES=eth0 makes MPI use the eth0 network interface instead of InfiniBand.
Test Parameters¶
The all_reduce_perf_mpi test uses the following parameters that you can customize:
-b, -f and -e: The start size, the increment factor and the end size of data that the test uses. For example, -b 512M -f 2 -e 8G means that the first iteration works with 512 MiB of data, which then doubles in size at each following iteration (1 GiB, 2 GiB, 4 GiB) until it reaches 8 GiB. -g: The number of GPUs per task.
For more parameters, see NCCL tests documentation.