Get Started with Auto Mode for Amazon EKS with Rafay

This is Part 3 in our series on Amazon EKS Auto Mode. In the previous posts, we explored:

  1. Part 1: An Introduction: Learn the core concepts and benefits of EKS Auto Mode.
  2. Part 2: Considerations: Understand the key considerations before Configuring EKS Auto Mode.

In this post, we will dive into the steps required to build and manage an Amazon EKS cluster with Auto Mode template using the Rafay Platform. This exercise is specifically well suited for platform teams interested in providing their end users with a controlled self-service experience with centralized governance.

EKS Auto Mode Cluster in Rafay

EKS Auto Mode - Considerations

In the introductory blog on Auto Mode for Amazon EKS, we described the basics of this new capability that was announced at AWS re:Invent 2024. In this blog, we will review considerations that organizations need to factor in before using EKS in Auto Mode.

Note

Please consider this as a living/evolving document. EKS Auto Mode is relatively new and we update this blog with new learnings/findings.

Considerations for EKS Auto Mode

EKS Auto Mode - An Introduction

The Rafay team just got back late last week from an incredibly busy AWS re:Invent 2024. Congratulations to the EKS Product team led by our friend, Nate Taber for the launch of Auto Mode for EKS.

Since this announcement last week, we have had several customers reach out and ask us for our thoughts on this newly launched EKS Auto Mode service. There are several blogs that already describe "How Auto Mode for EKS works etc". In this blog series, I will attempt to provide perspective on "Why", "Why Now?" and "What this means for the industry".

EKS Auto Mode

Deploying Custom CNI (Kube-OVN) in Rafay MKS Upstream Kubernetes Cluster Using the Blueprint Add-On Approach

In continuation of our Part 1 intro blog on the Kube-OVN CNI, this is Part 2, where we will cover how easy it is to manage CNI configurations using Rafay's Blueprint Add-On approach.In the evolving cloud-native landscape, networking requirements are becoming more complex, with platform teams needing enhanced control and customization over their Kubernetes clusters. Rafay's support for custom, compatible CNIs allows organizations to select and deploy advanced networking solutions tailored to their needs. While there are several options available, this blog will focus specifically on deploying the Kube-OVN CNI. Using Rafay’s Blueprint Add-On approach, we will guide you through the steps to seamlessly integrate Kube-OVN into an upstream Kubernetes cluster managed by Rafay’s Managed Kubernetes Service.

Our upcoming release, scheduled for December in the production environment, introduces several new features and enhancements. Each of these will be covered in separate blog posts. This particular blog focuses on the support and process for deploying Kube-OVN as the primary CNI on an upstream Kubernetes cluster.

kube ovn

Watch a video showcasing how users can customize and configure Kube-OVN as the primary CNI on Rafay MKS Kubernetes clusters.

The Kube-OVN CNI: A Powerful Networking Solution for Kubernetes

Kubernetes has become the de facto standard for orchestrating containerized applications, but efficient networking remains one of the biggest challenges. For Kubernetes networking, Container Network Interface (CNI) plugins handle the essential task of managing the network configuration between pods, nodes, and external systems. Among these CNI plugins, Kube-OVN stands out as a feature-rich and enterprise-ready solution, designed for cloud-native applications requiring robust networking features.

In this blog, we will discuss how it is different from popular CNI plugins such as Calico and Cilium and use cases where it is particularly useful.

Kube-OVN Logo

Introducing "Schedules" on the Rafay Platform: Simplifying Cost Optimization and Compliance for Platform Teams

Platform teams today are increasingly tasked with balancing cost efficiency, compliance, and operational agility across complex cloud environments. Actions such as cost-optimization measures and compliance-related tasks are critical, yet executing these tasks consistently and effectively can be challenging.

With the recent introduction of the “Schedules” capability on the Rafay Platform, platform teams can now orchestrate one-time or recurring actions across environments in a standardized, centralized manner. This new feature enables teams to implement cost-saving policies, manage compliance actions, and ensure operational efficiency—all from a single interface. Here’s a closer look at how this feature can streamline your workflows and add value to your platform operations.

Schedules

Spatial Partitioning of GPUs using Nvidia MIG

In the prior blogs, we discussed why GPUs are managed differently in Kubernetes, how the GPU Operator helps streamline management and various strategies to share GPUs on Kubernetes. In 2020, Nvidia introduced Multi-Instance GPU (MIG) that takes GPU sharing to a different level.

In this blog, we will start by reviewing some common industry use cases where MIG is used and then dive deeper into how MIG is configured and used.

Nvidia MIG

GPU Sharing Strategies in Kubernetes

In the previous blogs, we discussed why GPUs are managed differently in Kubernetes and how the GPU Operator can help streamline management. In Kubernetes, although you can request fractional CPU units for workloads, you cannot request fractional GPU units.

Pod manifests must request GPU resources in integers which results in an entire physical GPU allocated to one container even if the container only requires a fraction of the resources. In this blog, we will describe two popular and commonly used strategies to share a GPU on Kubernetes.

GPU Sharing in Kubernetes

Amazon EKS v1.31 using Rafay

Our recent release update in Oct to our Production environment adds support for a number of new features and enhancements. We will write about the other new features in separate blogs. This blog is focused on our turnkey support for Amazon EKS v1.31.

Both new cluster provisioning and in-place upgrades of existing EKS clusters are supported. As with most Kubernetes releases, this version also deprecates and removes a number of features. To ensure there is zero impact to our customers, we have made sure that every feature in the Rafay Kubernetes Operations Platform has been validated on this Kubernetes version.

Kubernetes v1.31

Why do we need a GPU Operator for Kubernetes

This is a follow up from the previous blog where we discussed device plugins for GPUs in Kubernetes. We reviewed why the Nvidia device plugin was necessary for GPU support in Kubernetes. A GPU Operator is needed in Kubernetes to automate and simplify the management of GPUs for workloads running on Kubernetes.

In this blog, we will look at how a GPU operator helps automate and streamline operations through the lens of a market leading implementation by Nvidia.

Without and With GPU Operator