Requirements
This section captures the prerequisites that are required to be in place before you can deploy and operate the Serverless Inference offering.
Rafay Control Plane¶
The Operations Console in the Rafay Controller is where the administrator will configure/deploy models, manage their lifecycle and share the models with tenant orgs. This is also where usage (token counts) is aggregated and persisted for billing etc. The serverless inference components can be installed as an add-on to the customer’s existing Rafay Controller deployment.
Info
To ensure compatibility, the Rafay controller version needs to be v3.1-36 or higher.
Data Plane for Serverless Inferencing¶
The data plane is where the actual inference requests from users are handled and processed. This will be a number of GPU servers running in a datacenter. We will deploy Rafay MKS (Upstream Kubernetes) on the servers and will act as the substrate for the data plane software components of the Serverless Inference solution.
Kubernetes Master¶
For a HA deployment, the k8s control plane needs to comprise at least 3-CPU nodes (8 CPU, 16G Memory, 100G RAW storage each)
Worker Nodes¶
These will be GPU servers that will be converted into Kubernetes worker nodes. The number of worker nodes depends on desired LLMs and expected scale at which the operator wishes to deploy and operate their service.
Info
Another consideration is whether some models need to be deployed as dedicated endpoints for specific customers/tenants.
Operating System¶
Ensure that the bare metal servers (nodes) are installed with standard 64-bit Ubuntu 24.04 LTS.
Important
Please do not install any GPU drivers on the Linux server. These will be automatically installed and configured via the GPU Operator.
Networking¶
The cluster's nodes (control plane and worker nodes) need to interact with each other over a local network. Please ensure that all the servers can communicate with each other over all ports via a local high speed network.
Important
It is not recommended to deploy firewall or proxies between these nodes because they will significantly impact performance and latency of the end user facing service.
Internet Connectivity¶
For providers planning to provide serverless inferencing service over the Internet to customers, please ensure that all worker nodes have access to the Internet on port 443.
Local Object Storage¶
Ensure the GPU servers are configured to have network access to low latency, S3 compatible object storage. The size/capacity will depend on size and number of LLMs to be deployed. As a guideline, >2 TB storage with the ability to expand later is a good starting point.
Note
The solution can also be optionally configured to "dynamically download and cache" the model's weights from repos such as HuggingFace and Nvidia's NGC. This option is not recommended because it can be error prone for large models. Admins are strongly recommended to download and host the model in their local storage namespace backed by high speed storage.
Load Balancer¶
All user requests will be serviced via a load balancer (MetalLB or alternative) on port 443.
Public IP Pool¶
All inference endpoints will need at least one public IP (3 preferred) so that they can serve multiple models from the same, unified endpoint.
TLS Certificates¶
The endpoints serving the serverless inferencing offering will terminate on https. They will require trusted TLS certficates for the domain the endpoint is served on.