Model Deployments are "running instances" of an already configured model. When a new model is created and configured, by default, it has zero active model deployments. For example, in the image below, for the Facebook OPT 125m model, there are no active model deployments.

Administrators can deploy and operate multiple model deployments for a given model. In the image below, for the "llama-8b-instruct" model, there is one active model deployment.

New Deployment¶

Click on "Deploy" to start a new model deployment.

Provide a name (unique in your environment) and an optional description
The "model" field will auto populate since it is a deployment of a specific model
Select "endpoint" from the dropdown list which will service requests to our model deployment

Select Inference Engine¶

In this step, the admin has to select their preferred Inference engine. Three options are currently supported:

Select the preferred Inference engine

vLLM,
NIM and
Nvidia Dynamo. (coming soon!)

Info

The default engine selection is vLLM. The use of NIM requires a license and keys from Nvidia. Please work with your Nvidia team for this.

vLLM-Inference Engine¶

Follow the steps below if you selected vLLM as the inference engine.

Important

The vLLM container image is extremely large (~10-25 GB). Admins are strongly recommended to download and host the vLLM container image locally in a container registry. This ensures sovereignity, security and performance.

Specify the path for the vLLM container image and tag
Specify the number of replicas (default is 1)
Specify the size of the volume for each replica

Resource Requests & Limits¶

Update the default resource requests/limits (CPU, Memory and GPU) if required. This step ensures that you allocate required resources so that the vLLM pod is stable and reliable.

Auto Scaling¶

Enable auto scaling of replicas if requred. Once this is enabled, the admin will be provided with configuration details for auto scaling.

Specify minimum number of replicas. This will be the base capacity of the service
Specify maximum number of replicas. This is the upper limit for the service
Specify metrics that will be used to trigger auto scaling events.

You can select from either CPU or Memory Resource and specify the utilization threshold. Once the threshold is breached, auto scaling will be performed.

Auto Scaling Behavior¶

Admins are also provided with access to advanced configurations for auto scaling behavior.

Scale Down

Stabilization window in seconds for scale down events and associated policies

Scale Up

Stabilization window in seconds for scale up events and associated policies

Advanced Configuration¶

Admins can also fine tune/optimize the vLLM inference engine by providing "custom environment variables". For example, vLLM's environment variable's documentation is available here

NIM-Inference Engine¶

Specify number of replicas
Specify number of GPUs
Add environment variables (Key + Value)

Info

The latest version of the container image for NIM is downloaded from Nvidia's NGC repository.

Rate Limiting¶

Rate limits are critical to ensure that a single user does not overwhelm the shared platform and consume all the available resources. Admins should specify the following parameters for rate limiting.

Max tokens per minute
Max requests per minute

Specify Pricing¶

Select the currencys used for billing (default = USD)
Specify the cost per "1M" input and output tokens
Click on Save once you have specified all the required inputs

Input tokens are the text you send to an LLM, while output tokens are the text the LLM generates back. Output tokens are typically more expensive because they require more computational power to generate one by one, whereas input tokens are processed in a single pass.

It is generally common for providers to charge more for "output" tokens vs "input" tokens because every request's input can be completely different and no optimizations are generally possible to drive down processing costs.

Note

Note that Rafay's serverless inferencing solution allows you to charge for input and output tokens at different rates.

View Deployment¶

To view a deployment, click on the name. You will be presented with the details of the deployment.

Edit Deployment¶

Click on the "ellipses" under Actions and select "Edit Configuration". Make the updates you require and save.

Delete Deployment¶

Click on the "ellipses" under Actions and select "Delete" to delete the deployment.

Click on the "ellipses" under Actions. Now, click on "Manage Sharing" to initiate a workflow to share the model with All or Select tenant orgs.

By default, a newly created model is not shared with any tenant org.
Select "All Orgs" to make the model available to all tenant orgs under management
Select "Select Orgs" to make the model available to selected tenant orgs.

--

Model Metrics¶

For a given model deployment, model metrics are aggregated and available to the administrator. The metrics are continuously aggregated, but calculated every 60 minutes. So, the data points are available for 1hr time periods

Click on a model deployment
Click on metrics

Admins can filter and visualize the metrics for a specific time period.

Time to First Token (TTFT)¶

Time-to-First-Token (TTFT) measures how quickly an LLM begins generating output after receiving a prompt. It reflects initial processing latency, including model loading, prompt encoding, and the start of inference.

Lower TTFT improves responsiveness and user experience, especially for interactive applications like chat, streaming responses, and real-time decision systems.

Inter Token Latency¶

Inter-token latency measures the time an LLM takes to generate each subsequent token after the first. It reflects the model’s throughput and compute efficiency during streaming output.

Lower inter-token latency enables smoother, more natural real-time responses, improving usability for chat systems, agents, and interactive AI applications.