Skip to content

Mohan Atreya

Break Glass Workflows for Developer Access to Kubernetes Clusters - Introduction

In any large-scale, production-grade Kubernetes setup, maintaining the security and integrity of the clusters is critical. However, there are exceptional circumstances—such as production outages or critical bugs—where developers need emergency access to a Kubernetes cluster to resolve issues.

This is where a "Break Glass" process comes into play. It is a controlled procedure that grants temporary, elevated access to developers in critical situations, with the appropriate safeguards in place to minimize risks.

Break Glass

Pod Identity versus IRSA for Amazon EKS - Part 1

When managing containerized applications on Amazon Elastic Kubernetes Service (EKS), a critical concern is securely granting permissions to your applications so that they can securely access AWS resources. Traditionally, AWS has provided mechanisms like IAM Roles for Service Accounts (IRSA) to enable fine-grained permissions management within EKS clusters. However, EKS Pod Identity, a newer feature, offers a more refined and efficient solution.

In this blog, we’ll explore how EKS Pod Identity differs from IRSA, and why it represents a significant improvement for identity management in Amazon EKS based environments. Let's assume our EKS cluster resident application needs to securely access data in an AWS s3 bucket.

App Accessing AWS S3

Bringing DevOps and Automation to Machine Learning via MLOps

The vast majority of organizations are new to AI/ML. As a result, most in-house systems and processes supporting this is likely ad-hoc. Industry analysts like Gartner forecast that organizations will need to quickly transition from Pilots to Production with AI/ML in order to make it across the chasm.

Most organizations already have reasonably mature DevOps processes and systems in place. So, going mainstream with AI should be a walk in the park. Correct? Turns out that this is not really true “IT leaders responsible for AI are discovering the AI pilot paradox, where launching pilots is deceptively easy but deploying them into production is notoriously challenging.” by Chirag Dekate, Gartner

In this blog, we will try and answer the following question:

Why do we need a new process called MLOps when most organizations already have reasonably mature DevOps practices? How is MLOps different from DevOps?

DevOps vs MLOps

GPU Metrics - SM Clock

In the previous blog, we discussed why tracking and reporting GPU Memory Utilization metrics matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU SM Clock. The GPU SM clock (Streaming Multiprocessor clock) metric refers to the clock speed at which the GPU's cores (SMs) are running.

The SM is the main processing unit of the GPU, responsible for executing compute tasks such as deep learning operations, simulations, and graphics rendering. Monitoring the SM clock speed can help users assess the performance and health of your GPU during workloads and detect potential bottlenecks related to clock speed throttling.

GPU SM Clock

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

GPU Metrics - Memory Utilization

In the introductory blog on GPU metrics, we discussed about the GPU metrics that matter and why they matter. In this blog, we will dive deeper into one of the critical GPU metrics i.e. GPU Memory Utilization.

GPU memory utilization refers to the percentage of the GPU’s dedicated memory (i.e. framebuffer) that is currently in use. It measures how much of the available GPU memory is occupied by data such as models, textures, tensors, or intermediate results during computation.

GPU Memory Utilization

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

What GPU Metrics to Monitor and Why?

With the increasing reliance on GPUs for compute-intensive tasks such as machine learning, deep learning, data processing, and rendering, both infrastructure administrators and users of GPUs (i.e. data scientists, ML engineers and GenAI app developers) require timely access and insights into performance, efficiency, and overall health of their GPU resources.

In order to make data driven, logical decisions, it is critical for these users to have access to critical metrics for their GPUs. This is the first blog in a series where we will describe the GPU metrics that you should track and monitor. In subsequent blogs, we will do a deep dive into each metric, why it matters and how to use it effectively.

Intro to GPU Metrics

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

PyTorch vs. TensorFlow: A Comprehensive Comparison in 2024

Note

Listen to a conversation based on this blog post. Tell us what you think about it.

When it comes to deep learning frameworks, PyTorch and TensorFlow are two of the most prominent tools in the field. Both have been widely adopted by researchers and developers alike, and while they share many similarities, they also have key differences that make them suitable for different use cases.

We thought this blog would be timely especially with the PyTorch 2024 Conference right around the corner.

In this blog, we’ll explore the main differences between PyTorch and TensorFlow across several dimensions such as ease of use, dynamic vs. static computation, ecosystem, deployment, community, and industry adoption. In a follow-on blog, we will describe how Rafay’s customers use both PyTorch and TensorFlow for their AI/ML projects.

PyTorch vs TensorFlow

Secure Access to Azure Services using Workload Identity for Azure AKS

Although Azure Kubernetes Service (AKS) allows you to deploy containerized workloads in a managed Kubernetes environment, developers still need to deal with the challenge of securely managing access to Azure resources (e.g. Key Vault or Azure Storage). Traditionally, secrets like API keys or service account credentials are used to authenticate and authorize workloads, but this approach presents security risks and operational overhead.

In Azure for AKS clusters, developers have access to something similar called Workload Identity. It is a modern, secure, and scalable way to manage access without the hassle of managing secrets. In this blog post, we'll dive deep into what Workload Identity is, how it works in AKS, and why it's a game-changer for Kubernetes clusters on Azure.

App Accessing Azure Service

Note

In a related blog, we will see how users can achieve something similar in Amazon EKS clusters using EKS Pod Identity.