Model Deployments
Model Deployments are "running instances" of an already configured model. When a new model is created and configured, by default, it has zero active model deployments. For example, in the image below, for the Facebook OPT 125m model, there are no active model deployments.
Administrators can deploy and operate multiple model deployments for a given model. In the image below, for the "llama-8b-instruct" model, there is one active model deployment.
New Deployment¶
Click on "Deploy" to start a new model deployment.
- Provide a name (unique in your environment) and an optional description
- The "model" field will auto populate since it is a deployment of a specific model
- Select "endpoint" from the dropdown list which will service requests to our model deployment
Select Inference Engine¶
In this step, the admin has to select their preferred Inference engine. Three options are available: vLLM, NIM and Nvidia Dynamo. The default engine is vLLM.
- Select the preferred Inference engine
- Specify number of replicas
- Specify number of GPUs
Admins can also fine tune/optimize the inference engine by providing "custom environment variables". For example, vLLM's environment variable's documentation is available here
Info
NIM requires a license from Nvidia. Please work with your Nvidia team for this.
Specify Pricing¶
- Select the current used for billing (default = USD)
- Specify the cost per 1M input and output tokens.
The Rafay platform allows you to charge for both input and output tokens at different rates for input and output.
Info
Input tokens are the text you send to an LLM, while output tokens are the text the LLM generates back. Output tokens are typically more expensive because they require more computational power to generate one by one, whereas input tokens are processed in a single pass.
Once you have specified all the required inputs, click on Save.
View Deployment¶
To view a deployment, click on the name. You will be presented with the details of the deployment.
Edit Deployment¶
Click on the "ellipses" under Actions and select "Edit Configuration". Make the updates you require and save.
Delete Deployment¶
Click on the "ellipses" under Actions and select "Delete" to delete the deployment.
Share Deployment¶
Click on the "ellipses" under Actions. Now, click on "Manage Sharing" to initiate a workflow to share the model with All or Select tenant orgs.
- By default, a newly created model is not shared with any tenant org.
- Select "All Orgs" to make the model available to all tenant orgs under management
- Select "Select Orgs" to make the model available to selected tenant orgs.






