Skip to content

Setup

This document captures the step-by-step instructions for a GPU Cloud admin to prepare a stock Ubuntu OS image for automated GPU and system metrics collection in the Rafay GPU PaaS environment. These instructions guide administrators through preparing an Ubuntu VM for automated collection and aggregation of GPU and VM metrics, which are exported via DCGM Exporter and forwarded using the OpenTelemetry Collector to Rafay’s Time Series backend.

Info

Once you have completed the steps below, your Ubuntu OS image will be fully equipped to collect, export, and transmit GPU and VM telemetry. Both dcgm-exporter and otelcol will run as background services. The data will automatically appear in your Rafay GPU PaaS dashboards for monitoring and alerting.


1. Add NVIDIA CUDA Repository Keyring

curl -s -L https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb -o cuda-keyring.deb
dpkg -i cuda-keyring.deb
apt-get update

What & Why

This adds the official NVIDIA CUDA software repository and its GPG key to your system so that you can securely install the NVIDIA GPU drivers, toolkit, and other utilities.


2. Install GPU Drivers and Container Support

apt install nvidia-driver-550 -y
apt install nvidia-container-toolkit -y
apt install cuda-toolkit -y
reboot

What & Why:

  • nvidia-driver-550: Installs the latest production GPU driver.
  • nvidia-container-toolkit: Enables GPU usage within containers (for future containerized telemetry if needed).
  • cuda-toolkit: Required for low-level GPU support and development tools.
  • reboot: Applies driver and kernel module changes.

3. Install and Start NVIDIA DCGM (Data Center GPU Manager)

apt-get install -y datacenter-gpu-manager
systemctl enable --now nvidia-dcgm
dcgmi profile --resume

What & Why:

  • DCGM provides a high-performance interface for monitoring NVIDIA GPUs.
  • The dcgmi profile --resume command ensures the GPU resumes from any special profiling state.

4. Download and Set Up the DCGM Exporter

mkdir tmp
cd tmp/
wget https://dev-rafay-controller.s3.us-west-1.amazonaws.com/dcgm-exporter-files/default-counters.csv
wget https://dev-rafay-controller.s3.us-west-1.amazonaws.com/dcgm-exporter-files/dcgm-exporter
chmod +x dcgm-exporter

What & Why:

  • dcgm-exporter is a lightweight binary that exposes GPU metrics over a Prometheus-compatible endpoint.
  • default-counters.csv defines which GPU metrics to export.

5. Move Exporter Files to Correct Locations

mkdir -p /etc/dcgm-exporter/
cp -rp default-counters.csv /etc/dcgm-exporter/default-counters.csv
cp -rp dcgm-exporter /usr/bin/

What & Why:

  • Moves configuration and executable to appropriate system locations.
  • Ensures the service can be run from /usr/bin and configured via /etc.

6. Set Up DCGM Exporter as a Systemd Service

DCGM_EXPORTER_SERVICE="/etc/systemd/system/dcgm-exporter.service"
cat <<EOF | tee $DCGM_EXPORTER_SERVICE
[Unit]
Description=DCGM Exporter Service
After=network.target

[Service]
ExecStart=/usr/bin/dcgm-exporter
Restart=always
User=root
WorkingDirectory=/usr/bin
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=dcgm-exporter

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now dcgm-exporter

What & Why:

  • Defines a persistent systemd service for dcgm-exporter so it starts on boot and restarts automatically on failure.
  • Logs are sent to syslog for traceability.

7. Install OpenTelemetry Collector (OTel)

wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.123.0/otelcol_0.123.0_linux_amd64.deb
dpkg -i otelcol_0.123.0_linux_amd64.deb

What & Why:

  • otelcol collects and forwards metrics from the DCGM exporter (and optionally host metrics) to the Rafay observability backend.
  • This is the telemetry agent that integrates the metrics pipeline end-to-end.