Skip to content

Requirements

This section captures the prerequisites that are required to be in place before you can deploy and operate the Serverless Inference offering.


Rafay Control Plane

The Operations Console in the Rafay Controller is where the administrator will configure/deploy models, manage their lifecycle and share the models with tenant orgs. This is also where usage (token counts) is aggregated and persisted for billing etc. The serverless inference components can be installed as an add-on to the customer’s existing Rafay Controller deployment.

Info

To ensure compatibility, the Rafay controller version needs to be v3.1-36 or higher.


Data Plane for Serverless Inferencing

The data plane is where the actual inference requests from users are handled and processed. This will be a number of GPU servers running in a datacenter. We will deploy Rafay MKS (Upstream Kubernetes) on the servers and will act as the substrate for the data plane software components of the Serverless Inference solution.

Kubernetes Master

For a HA deployment, the k8s control plane needs to comprise at least 3-CPU nodes (8 CPU, 16G Memory, 100G RAW storage each)

Worker Nodes

These will be GPU servers that will be converted into Kubernetes worker nodes. The number of worker nodes depends on desired LLMs and expected scale at which the operator wishes to deploy and operate their service.

Info

Another consideration is whether some models need to be deployed as dedicated endpoints for specific customers/tenants.


Operating System

Ensure that the bare metal servers (nodes) are installed with standard 64-bit Ubuntu 24.04 LTS.

Important

Please do not install any GPU drivers on the Linux server. These will be automatically installed and configured via the GPU Operator.


Networking

The cluster's nodes (control plane and worker nodes) need to interact with each other over a local network. Please ensure that all the servers can communicate with each other over all ports via a local high speed network.

Important

It is not recommended to deploy firewall or proxies between these nodes because they will significantly impact performance and latency of the end user facing service.


Internet Connectivity

For providers planning to provide serverless inferencing service over the Internet to customers, please ensure that all worker nodes have access to the Internet on port 443.


Local Object Storage

Ensure the GPU servers are configured to have network access to low latency, S3 compatible object storage. The size/capacity will depend on size and number of LLMs to be deployed. As a guideline, >2 TB storage with the ability to expand later is a good starting point.

Note

The solution can also be optionally configured to "dynamically download and cache" the model's weights from repos such as HuggingFace and Nvidia's NGC. This option is not recommended because it can be error prone for large models. Admins are strongly recommended to download and host the model in their local storage namespace backed by high speed storage.


Load Balancer

All user requests will be serviced via a load balancer (MetalLB or alternative) on port 443.


Public IP Pool

All inference endpoints will need at least one public IP (3 preferred) so that they can serve multiple models from the same, unified endpoint.


TLS Certificates

The endpoints serving the serverless inferencing offering will terminate on https. They will require trusted TLS certficates for the domain the endpoint is served on.