Skip to content

2025

Why Inventory Management is Table Stakes for GPU Clouds

In the world of GPU clouds, where speed, scalability, and efficiency are paramount, it’s surprising how many “Neo cloud” providers still manage their infrastructure the old-fashioned way—through spreadsheets.

As laughable as it sounds, this is the harsh reality. Inventory management, one of the most foundational aspects of a reliable cloud platform, is often overlooked or under built. And for modern GPU clouds, that’s a deal breaker.

Inventory Management

Introducing Platform Version with Rafay MKS clusters.

Our upcoming release introduces support for a number of new features and enhancements. One such enhancement is the introduction of Platform Versioning for Rafay MKS clusters a major feature in our v3.5 release. This new capability is designed to simplify and standardize the upgrade lifecycle of critical components in upstream Kubernetes clusters managed by Rafay MKS.

Why Platform Version?

Upgrading Kubernetes clusters is essential, but the core components—such as etcd, CRI, and Salt Minion also require updates for:

  • Security patches
  • Compatibility with new Kubernetes features
  • Performance improvements

Platform Versioning introduces a structured, reliable, and repeatable upgrade path for these foundational components, reducing risk and operational overhead.

What is a Platform Version?

A Platform Version defines a tested and validated set of component versions that can be safely upgraded together. This ensures compatibility and stability across your clusters.

We are introducing v1.0.0 as the very first Platform Version for new clusters. This version includes:

  • CRI: v2.0.4
  • etcd: v3.5.21
  • Salt Minion: v3006.9

Note

For existing clusters, the initial platform version will be shown as v0.1.0, which is assigned for reference purposes to older clusters that were created before platform versioning was introduced. Please perform the upgrade to v1.0.0 during scheduled downtime, as it involves updates to core components such as etcd and CRI.


How Does Platform Versioning Work?

You can upgrade the Platform Version in two ways:

  • During a Kubernetes version upgrade
  • As a standalone platform upgrade

This flexibility allows you to keep your clusters secure and up to date, regardless of your Kubernetes upgrade schedule.

Platform Version


Controlled and Responsive Update Cadence

Platform Versions are not released frequently. New versions are published only when:

  • A high severity CVE or vulnerability is addressed
  • A major performance or compatibility feature is introduced
  • There are significant version changes in core components

This approach ensures that upgrades are meaningful and necessary, minimizing disruption.

Whenever a new Platform Version is released, existing clusters can seamlessly upgrade to the latest version, ensuring they benefit from the latest security patches and improvements without manual intervention.

Evolving Platform Versions and Expanding Coverage

We are committed to continuously improving Platform Versioning. In future releases, we will introduce new platform versions to to expand the scope of Platform Versioning by including more critical components as part of the platform version. For this initial release, we have started with three foundational components etcd, CRI, and Salt Minion because of their critical importance to cluster stability. Over time, we will enhance Platform Versioning to cover additional components, ensuring your clusters remain robust, secure, and up to date.

In Summary

Platform Versioning makes it easier than ever to keep your clusters current and secure by managing the upgrade lifecycle of foundational components like etcd, CRI, and Salt Minion.

Whether you apply it alongside a Kubernetes version bump or independently, Platform Versioning ensures your infrastructure remains stable, secure, and optimized now and in the future.


Comparing HPA and KEDA: Choosing the Right Tool for Kubernetes Autoscaling

In Kubernetes, autoscaling is key to ensuring application performance while managing infrastructure costs. Two powerful tools that help achieve this are the Horizontal Pod Autoscaler (HPA) and Kubernetes Event-Driven Autoscaling (KEDA). While they share the goal of scaling workloads, their approaches and capabilities are actually very different and distinct.

In this introductory blog, we will provide a bird's eye view of how they compare, and when you might choose one over the other.

HPA vs KEDA

Support for Parallel Execution with Rafay's Integrated GitOps Pipeline

At Rafay, we are continuously evolving our platform to deliver powerful capabilities that streamline and accelerate the software delivery lifecycle. One such enhancement is the recent update to our GitOps pipeline engine, designed to optimize execution time and flexibility — enabling a better experience for platform teams and developers alike.

Integrated Pipeline for Diverse Use Cases

Rafay provides a tightly integrated pipeline framework that supports a range of common operational use cases, including:

  • System Synchronization: Use Git as the single source of truth to orchestrate controller configurations
  • Application Deployment: Define and automate your app deployment process directly from version-controlled pipelines
  • Approval Workflows: Insert optional approval gates to control when and how specific pipeline stages are triggered, offering an added layer of governance and compliance

This comprehensive design empowers platform teams to standardize delivery patterns while still accommodating organization-specific controls and policies.

From Sequential to Parallel Execution with DAG Support

Historically, Rafay’s GitOps pipeline executed all stages sequentially, regardless of interdependencies. While effective for simpler workflows, this model imposed time constraints for more complex operations.

With our latest update, the pipeline engine now supports Directed Acyclic Graphs (DAGs) — allowing stages to execute in parallel, wherever dependencies allow.

Simplifying Blueprint and Add-on Management with Draft Versions

Managing infrastructure at scale demands both agility and precision—especially when it comes to version control. At Rafay, we have long supported versioning for key configuration objects such as Blueprints and Add-ons, enabling platform teams to roll out changes systematically and maintain operational consistency.

However, as many teams have discovered, managing these versions during testing and validation phases can introduce unnecessary complexity. We are excited to announce a major usability enhancement: Support for Draft Versions.

Why Versioning Matters

Versioning in Rafay’s platform delivers several key advantages:

  • Change Tracking: Keep a historical record of changes made to Blueprints and Add-ons over time
  • Staged Rollouts: Gradually deploy updates across environments and clusters to minimize risk
  • Compliance Assurance: Demonstrate adherence to organizational policies and track Day-2 changes in a controlled way

These capabilities are especially crucial for teams responsible for maintaining secure, production-grade Kubernetes environments

The Challenge: Version Sprawl During Testing

While versioning is powerful, it has traditionally introduced friction during the testing and validation phase. Each time a platform engineer made a minor change to an Add-on or Blueprint, a new version needed to be created—even if the version wasn’t production-ready.

This led to:

  • Version fatigue, with large volumes of partially validated versions cluttering the system
  • Increased manual overhead and inefficiency for platform teams
  • Risk of accidental usage of incomplete configurations in downstream projects

A Fresh New Look: Rafay Console Gets a UI/UX Makeover for Enhanced Usability

At Rafay, we believe that user experience is as critical as the powerful automation capabilities we deliver. With that commitment in mind, we’ve been working for the last few months on a revamp of the Rafay Console User Interface (UI). The changes are purposeful and designed to streamline navigation, increase operational clarity, and elevate your productivity. Whether you’re managing clusters, deploying workloads, or orchestrating environments, the new interface will put everything you need right at your fingertips.

The new UI will launch as part of our v3.5 Release scheduled to rollout end of May 2025. We understand change is hard and it can take a few hours for users to get used to the new experience. Note that existing projects and configurations remain unchanged, and users can continue managing their infrastructure and applications without interruption.

In this blog, we provide a closer look at the most impactful improvements and how they will benefit our users.

Migration

Introduction to User Namespaces in Kubernetes

In Kubernetes, some features arrive quietly, but leave a massive impact. Kubernetes v1.33 is turning out to be one such release where there are some features with massive impact. In the previous blog, my colleague described how you can provision and operate Kubernetes v1.33 clusters on bare metal and VM based environments using Rafay.

In this blog, we will discuss a new feature in v1.33 called User Namespaces. This feature is not a headline grabber such as a service mesh etc, but is a game changer for container security.

Container in a Jail

Kubernetes v1.33 for Rafay MKS

As part of our upcoming May release , alongside other enhancements and features, we are adding support for Kubernetes v1.33 with Rafay MKS (i.e., upstream Kubernetes for bare metal and VM-based environments).

Both new cluster provisioning and in-place upgrades of existing clusters are supported. As with most Kubernetes releases, v1.33 deprecates and removes several features. To ensure zero impact to our customers, we have validated every feature of the Rafay Kubernetes Operations Platform on this Kubernetes version.

Kubernetes v1.33 Release

Powering Multi-Tenant, Serverless AI Inference for Cloud Providers

The AI revolution is here, and Large Language Models (LLMs) are at its forefront. Cloud providers are uniquely positioned to offer powerful AI inference services to their enterprise and retail customers. However, delivering these services in a scalable, multi-tenant, and cost-effective serverless manner presents significant operational challenges.

Rafay enables cloud providers deliver Serverless Inference to 100s of users and enterprises.

Info

Earlier this week, we announced our Multi-Tenant Serverless Inference offering for GPU & Sovereign Cloud Providers. Learn more about this here.

Multi Tenant

Family vs. Lineage: Unpacking Two Often-Confused Ideas in the LLM World

LLMs have begun to resemble sprawling family trees. Folks that are relatively new to LLMs will notice two words appear constantly in technical blogs: "family" and "lineage".

They sound interchangeable and users frequently conflate them. But, they describe different slices of an LLM’s life story.

Important

Understanding the differences is more than trivia. This determines how you pick models, tune them, and keep inference predictable at scale.

LLM Family vs Lineage