Skip to content

Mohan Atreya

EKS Auto Mode - Considerations

In the introductory blog on Auto Mode for Amazon EKS, we described the basics of this new capability that was announced at AWS re:Invent 2024. In this blog, we will review considerations that organizations need to factor in before using EKS in Auto Mode.

Note

Please consider this as a living/evolving document. EKS Auto Mode is relatively new and we update this blog with new learnings/findings.

Considerations for EKS Auto Mode

EKS Auto Mode - An Introduction

The Rafay team just got back late last week from an incredibly busy AWS re:Invent 2024. Congratulations to the EKS Product team led by our friend, Nate Taber for the launch of Auto Mode for EKS.

Since this announcement last week, we have had several customers reach out and ask us for our thoughts on this newly launched EKS Auto Mode service. There are several blogs that already describe "How Auto Mode for EKS works etc". In this blog series, I will attempt to provide perspective on "Why", "Why Now?" and "What this means for the industry".

EKS Auto Mode

The Kube-OVN CNI: A Powerful Networking Solution for Kubernetes

Kubernetes has become the de facto standard for orchestrating containerized applications, but efficient networking remains one of the biggest challenges. For Kubernetes networking, Container Network Interface (CNI) plugins handle the essential task of managing the network configuration between pods, nodes, and external systems. Among these CNI plugins, Kube-OVN stands out as a feature-rich and enterprise-ready solution, designed for cloud-native applications requiring robust networking features.

In this blog, we will discuss how it is different from popular CNI plugins such as Calico and Cilium and use cases where it is particularly useful.

Kube-OVN Logo

Spatial Partitioning of GPUs using Nvidia MIG

In the prior blogs, we discussed why GPUs are managed differently in Kubernetes, how the GPU Operator helps streamline management and various strategies to share GPUs on Kubernetes. In 2020, Nvidia introduced Multi-Instance GPU (MIG) that takes GPU sharing to a different level.

In this blog, we will start by reviewing some common industry use cases where MIG is used and then dive deeper into how MIG is configured and used.

Nvidia MIG

GPU Sharing Strategies in Kubernetes

In the previous blogs, we discussed why GPUs are managed differently in Kubernetes and how the GPU Operator can help streamline management. In Kubernetes, although you can request fractional CPU units for workloads, you cannot request fractional GPU units.

Pod manifests must request GPU resources in integers which results in an entire physical GPU allocated to one container even if the container only requires a fraction of the resources. In this blog, we will describe two popular and commonly used strategies to share a GPU on Kubernetes.

GPU Sharing in Kubernetes

Why do we need a GPU Operator for Kubernetes

This is a follow up from the previous blog where we discussed device plugins for GPUs in Kubernetes. We reviewed why the Nvidia device plugin was necessary for GPU support in Kubernetes. A GPU Operator is needed in Kubernetes to automate and simplify the management of GPUs for workloads running on Kubernetes.

In this blog, we will look at how a GPU operator helps automate and streamline operations through the lens of a market leading implementation by Nvidia.

Without and With GPU Operator

Using GPUs in Kubernetes

Unlike CPU and Memory, GPUs are not natively supported in Kubernetes. Kubernetes manages CPU and memory natively. This means it can automatically schedule containers based on these resources, allocates them to Pods, and handles resource isolation and over-subscription.

GPUs are considered specialized hardware and require the use of device plugins to support GPUs in Kubernetes. Device Plugins help make Kubernetes GPU-aware allowing it to Discover, Allocate and Schedule GPUs for containerized workloads. Without a device plugin, Kubernetes is unaware of the GPUs available on the nodes and cannot assign them to Pods. In this blog, we will discuss why GPUs are not natively supported and understand how device plugins help address this gap.

Device Plugin K8s

Rafay Newsletter-September 2024

Welcome to the September 2024 edition of the Rafay customer newsletter. This month, we’re excited to bring you the latest product enhancements and insightful content crafted to help you make the most of your AI/ML, Kubernetes, and cloud-native operations.

Every month, we push out a number of incremental updates to our product documentation, new functionality, our YouTube channel, tech blogs etc. Our users tell us that it will be great if we summarized all the updates for the month in the form of a newsletter that they can read or listen to in 10 minutes.

Newsletter Sep 2024

Why do we need Custom Schedulers for Kubernetes?

The Kubernetes scheduler is the brain that is responsible for assigning pods to nodes based on resource availability, constraints, and affinity/anti-affinity rules. For small to medium-sized clusters running simple stateless applications like web services or APIs, the default Kubernetes scheduler is a great fit. The default Kubernetes scheduler manages resource allocation, ensures even distribution of workloads across nodes, and supports features like node affinity, pod anti-affinity, and automatic rescheduling.

The default scheduler is extremely well-suited for long-running applications like web services, APIs, and microservices. Learn more about the scheduling framework.

Unfortunately, AI/ML workloads have very different requirements that the default scheduler cannot satisfy!

k8s Scheduling Framework