Troubleshooting

The issues span different areas such as deployment failures, cluster-level misconfigurations, UI access errors, and authentication challenges. The goal is to help users quickly identify and resolve problems efficiently.

A common theme observed across various templates is capacity-related issues, often involving resource allocation, node scaling, and workload identity settings. Addressing these correctly ensures stable cluster operations and smooth Kubeflow deployment.

Each issue is documented with its error message, possible cause, recommended workaround, and additional comments where necessary.

Kubeflow - Errors and Troubleshooting Guide¶

Here are some scenarios that may arise when using the Kubeflow template.

a. YAML Parse Error Due to Multi-Arch Encoding¶

Error Message

Reason:
2 problems:

- activity in progress: group.res-gke-feast-gcp.output
- activity failed: group.res-gke-kubeflow-gcp.output: activity failed: group.res-gke-kubeflow-gcp.output: exit status 1

Error: YAML parse error on istio/templates/Secret/._kubeflow-gateway-tls-secret.yaml: error converting YAML to JSON: yaml: control characters are not allowed

  with helm_release.istio,
  on main.tf line 108, in resource "helm_release" "istio":
  108: resource "helm_release" "istio" {

Possible Causes

The Docker image was built and pushed to ArtifactDriver from a Mac machine.
Mac uses a different binary encoding process (multi-arch), causing issues when a GitOps Agent attempts to pull the correct Helm charts from the Docker image.

Resolution Steps

Rebuild the Docker image from an agent running on an Ubuntu machine.
Push the newly built image to ArtifactDriver.
Retry deploying the Helm chart.

Note: This issue typically does not occur in production but may happen in development environments.

b. Deployment Failure - Missing Required Variable (rafay_project)¶

Error Message

Reason:
activity failed: group.res-gke-infra-gcp.destroy: activity failed: group.res-gke-infra-gcp.destroy: exit status 1

Error: No value for required variable

  on variables.tf line 71:
  71: variable "rafay_project" {

The root module input variable "rafay_project" is not set and has no default
value. Use a -var or -var-file command line argument to provide a value for
this variable.

Possible Causes

The error indicates that the rafay_project variable is not set and has no default value in the Terraform configuration.
This may occur if the GitOps Agent is out of date and missing the necessary updates.

Resolution Steps - Update the GitOps Agent by pulling the latest Docker image:

docker pull <latest-agent-image>

Redeploy after pulling the updated image.
If the issue persists, manually verify that the rafay_project variable is correctly set in the Terraform configuration.
Pass the required variable explicitly using:

terraform apply -var="rafay_project=<project_name>"

Best Practices

If pulling and deploying again does not fix the issue, there may be an underlying problem in the agent or Terraform configuration.

c. Invalid TLS and Domain Selectors¶

Error Message

handle failed: unable to build run config for trigger 01JHNBMC8S7FCZYKRJ1JHKB6GY:  
environment template kubeflow-gcp-template variable TLS Certificate selector is invalid;  
environment template kubeflow-gcp-template variable TLS Key selector is invalid;  
environment template kubeflow-gcp-template variable Rafay Domain selector is invalid.

Possible Causes

The Config Context for system-dns-config is out of date.
The GitOps Agent is not running the latest image, leading to invalid selector references.

Resolution Steps

Update the GitOps Agent by pulling the latest Docker image:

docker pull <image_name>

Redeploy the environment template after updating the agent.

Note: This issue should not occur in production but may appear in development environments.

d. Deployment Failure - Invalid API Key in Function Call¶

Error Message

Error: Error in function call

  on outputs.tf line 9, in output "host":
   9:   value = yamldecode(data.rafay_download_kubeconfig.kubeconfig_cluster.kubeconfig).clusters[0].cluster.server
    ├────────────────
    │ while calling yamldecode(src)
    │ data.rafay_download_kubeconfig.kubeconfig_cluster.kubeconfig is """"

Call to function "yamldecode" failed: on line 1, column 1: missing start of
document.

Possible Causes

This error occurs when an incorrect API Key is used for the deployment.
The API Key provided does not match the organization the deployment is running in, leading to a failure when retrieving the Kubernetes configuration.

Resolution Steps

Verify that the API Key being used is valid and active.
Ensure the API Key corresponds to the correct organization where the deployment is running.
If necessary, generate a new API Key from the Rafay Controller UI and update the deployment configuration.

Note: - This issue is commonly caused by incorrect API Key usage. - Ensuring the API Key matches the deployment organization will prevent this error.

e. CSRF Check Failed¶

Error Message

CSRF check failed. This may happen if you opened the login form in more than 1 tab. Please try to log in again.

Possible Causes

This error occurs when attempting to access the Kubeflow UI and logging in via Okta immediately after deployment.
The DNS entry may not have fully propagated, causing temporary login failures.

Resolution Steps

Wait for 1-5 minutes and try logging in again.
If the issue persists, wait for 2-7 minutes to allow DNS propagation to complete.
Clear browser cache and cookies before retrying.

Note: This issue is temporary and usually resolves once DNS propagation is complete.

f. Access Denied After Successful Environment Deployment¶

Error Message

Kubeflow UI leads to the following error after successful environment deployment.

Access Denied

Possible Causes

The oidc-authservice-0 Pod in the cluster has not initialized properly.
This prevents proper authentication, leading to an access denial.

Resolution Steps

Navigate to Infrastructure → Clusters → <underlying_cluster_name> → Resources → Pods.
Locate oidc-authservice-0 in the istio-system namespace.
Delete the Pod to force a restart.

Best Practices

The pod will automatically restart in a Container Initializing sequence.
If it comes back up with a Running status (1/1), the Kubeflow UI should be accessible.

Capacity Issues¶

Below are some capacity-related issues that can occur across templates.

a. Helm Release Name Already in Use¶

Error Message

Error: cannot re-use a name that is still in use

  with helm_release.feast,
  on main.tf line 73, in resource "helm_release" "feast":
  73: resource "helm_release" "feast" {

time=2025-01-07T01:15:00.755Z level=ERROR msg="failed to run open tofu job" error-source=provider error="exit status 1

Error: cannot re-use a name that is still in use

  with helm_release.feast,
  on main.tf line 73, in resource "helm_release" "feast":
  73: resource "helm_release" "feast" {

Possible Causes

The Helm release name is already in use, preventing reinstallation.
This issue often occurs when the same environment deployment has been redeployed multiple times (2-4 times or more) without proper cleanup.

Resolution Steps

Navigate to Infrastructure → Clusters → <underlying_cluster_name> → Kubectl. Run the following command to list Helm releases:

helm ls -A

- Identify the conflicting resource name (feast). - Uninstall the existing Helm release by running:

helm uninstall feast -n feast

feast is both the resource name and the namespace in this case.
If unsure about the namespace, locate it under: - Infrastructure → Clusters → <underlying_cluster_name> → Resources → Pods/Deployments

Note: This issue commonly occurs when an environment deployment is redeployed multiple times without cleaning up previous Helm releases.

b. EOF Error Preventing Request Execution¶

Error Message

Error: 1 error occurred:
    * an error on the server ("EOF") has prevented the request from succeeding (post serviceaccounts)

Possible Cause

The Google account or service account used for deployment has been signed out
A connectivity issue interrupted the deployment process

Resolution Steps

Redeploy the Environment run to resume the process from where it left off
Ensure that the Google account/service account is active and signed in
Check for any network connectivity issues before redeploying

Note: This is a temporary issue that can be resolved by redeploying.

c. Cluster Configuration Issue - Workload Identity Not Enabled¶

Error Message

Option: Enable Workload Identity

Must be configured to True if the underlying cluster is deployed via the `system-gke-cluster` template, or the `Enable Workload Identity` checkbox must be checked if deployed via Rafay Controller UI's `New Cluster` Provisioning.

Possible Cause

Workload Identity was not enabled during cluster creation.
If deployed using the system-gke-cluster template, the Enable Workload Identity option must be set to True.
If deployed via the Rafay Controller UI, the Enable Workload Identity checkbox must be checked.

Resolution Steps

Verify Cluster Settings: - Check if Workload Identity is enabled in GCP Kubernetes cluster settings.
If Workload Identity is not enabled: - Delete the existing cluster. - Create a new cluster with Workload Identity enabled.
Ensure the required IAM roles are assigned for authentication.

Best Practices

If Workload Identity is not enabled, Kubeflow deployment will fail to bring up MLOps services properly.

vCluster Environment Template - Errors and Troubleshooting Guide¶

a. Failure at Environment Deployment, No Activity Starts¶

Error Message

If the environment deployment fails and no activity is initiated, it is often caused by an issue with the Agent.

Possible Causes

The agent is not assigned to the environment
The agent is inactive or disconnected

Resolution Steps

If no agent is present, deploy a new agent or share an existing agent from another project
Check if the agent is healthy; if not, restart the agent
Edit the Environment Deployment, associate it with a healthy agent, and click Deploy again

b. Failure in `group.*.artifact`¶

Error Message

invalid driver config: failed to evaluate "$ctx.activities[\"group.res-gen-vcluster.artifact\"].output.files[\"job.tar.zst\"].token)$": invalid expression: output: undefined field: "job.tar.zst":

This error occurs when the Git repository or repodriver associated with the resource template is inaccessible, or there are storage issues at the backend.

Possible Causes

The Git repository associated with the resource template is not accessible, or the repodriver is unavailable
There is a storage issue at the backend
The repodriver architecture is not compatible with the node where the agent is running

Resolution Steps

Check if the Git repository is accessible
Ensure the resource template has the correct repository, path, and branch defined
Verify the agent’s health and restart if necessary

c. Namespace Already Exists¶

Error Message

If the namespace is in a terminating state and the vCluster template attempts to create the same namespace, the environment creation will fail.

Resolution Step

Change the namespace name and retry the deployment.

d. Cluster Name Already Exists in the Cluster Infrastructure Console¶

Error Message

By default, the vCluster template uses the environment name as the cluster name. This error occurs if a cluster with the same name already exists.

Resolution Step

Re-deploy the vCluster template with a different name.

e. Not Enough Resources¶

Error Message

vCluster runs on the host cluster. If the host cluster does not have sufficient resources (minimum 4 CPUs) or if other clusters and workloads have already consumed available resources, the vCluster deployment may fail.

Resolution Step

Select a host cluster with enough free resources to support vCluster deployment.

f. Host Cluster Unreachable¶

When the host cluster is unreachable, the deployment may fail with an error.

Possible Causes

The host cluster is not reachable from the agent
The cluster name is incorrectly typed
The host cluster is unhealthy

Resolution Steps

Verify connectivity between the agent and the host cluster to ensure communication is possible
Check the cluster name for any typos or mismatches in the configuration
Inspect the cluster’s health status and ensure it is running properly. Restart or troubleshoot if necessary
Review network configurations to confirm that no firewall rules or security policies are blocking access

g. Blueprint Sync Fails¶

There might be an error when syncing a blueprint due to the following reasons:

Connectivity issues with the controller
Failure to deploy certain blueprint components
Insufficient resources in the host cluster

Resolution Step

May need to destroy and recreate the vCluster in the right host cluster.

Troubleshooting

Kubeflow - Errors and Troubleshooting Guide¶

a. YAML Parse Error Due to Multi-Arch Encoding¶

b. Deployment Failure - Missing Required Variable (rafay_project)¶

c. Invalid TLS and Domain Selectors¶

d. Deployment Failure - Invalid API Key in Function Call¶

e. CSRF Check Failed¶

f. Access Denied After Successful Environment Deployment¶

Capacity Issues¶

a. Helm Release Name Already in Use¶

b. EOF Error Preventing Request Execution¶

c. Cluster Configuration Issue - Workload Identity Not Enabled¶

vCluster Environment Template - Errors and Troubleshooting Guide¶

a. Failure at Environment Deployment, No Activity Starts¶

b. Failure in group.*.artifact¶

c. Namespace Already Exists¶

d. Cluster Name Already Exists in the Cluster Infrastructure Console¶

e. Not Enough Resources¶

f. Host Cluster Unreachable¶

g. Blueprint Sync Fails¶

b. Failure in `group.*.artifact`¶