Skip to content

Troubleshooting

The issues span different areas such as deployment failures, cluster-level misconfigurations, UI access errors, and authentication challenges. The goal is to help users quickly identify and resolve problems efficiently.

A common theme observed across various templates is capacity-related issues, often involving resource allocation, node scaling, and workload identity settings. Addressing these correctly ensures stable cluster operations and smooth Kubeflow deployment.

Each issue is documented with its error message, possible cause, recommended workaround, and additional comments where necessary.


Kubeflow - Errors and Troubleshooting Guide

Here are some scenarios that may arise when using the Kubeflow template.

a. YAML Parse Error Due to Multi-Arch Encoding

Error Message

Reason:
2 problems:

- activity in progress: group.res-gke-feast-gcp.output
- activity failed: group.res-gke-kubeflow-gcp.output: activity failed: group.res-gke-kubeflow-gcp.output: exit status 1

Error: YAML parse error on istio/templates/Secret/._kubeflow-gateway-tls-secret.yaml: error converting YAML to JSON: yaml: control characters are not allowed

  with helm_release.istio,
  on main.tf line 108, in resource "helm_release" "istio":
  108: resource "helm_release" "istio" {

Possible Cause

  • The Docker image was built and pushed to ArtifactDriver from a Mac machine.
  • Mac uses a different binary encoding process (multi-arch), causing issues when a GitOps Agent attempts to pull the correct Helm charts from the Docker image.

Resolution Steps

  • Rebuild the Docker image from an agent running on an Ubuntu machine.
  • Push the newly built image to ArtifactDriver.
  • Retry deploying the Helm chart.

Note: This issue typically does not occur in production but may happen in development environments.

b. Deployment Failure - Missing Required Variable (rafay_project)

Error Message

Reason:
activity failed: group.res-gke-infra-gcp.destroy: activity failed: group.res-gke-infra-gcp.destroy: exit status 1

Error: No value for required variable

  on variables.tf line 71:
  71: variable "rafay_project" {

The root module input variable "rafay_project" is not set and has no default
value. Use a -var or -var-file command line argument to provide a value for
this variable.

Possible Cause

  • The error indicates that the rafay_project variable is not set and has no default value in the Terraform configuration.
  • This may occur if the GitOps Agent is out of date and missing the necessary updates.

Resolution Steps - Update the GitOps Agent by pulling the latest Docker image:

docker pull <latest-agent-image>
  • Redeploy after pulling the updated image.
  • If the issue persists, manually verify that the rafay_project variable is correctly set in the Terraform configuration.
  • Pass the required variable explicitly using:
terraform apply -var="rafay_project=<project_name>"

Best Practices

If pulling and deploying again does not fix the issue, there may be an underlying problem in the agent or Terraform configuration.

c. Invalid TLS and Domain Selectors

Error Message

handle failed: unable to build run config for trigger 01JHNBMC8S7FCZYKRJ1JHKB6GY:  
environment template kubeflow-gcp-template variable TLS Certificate selector is invalid;  
environment template kubeflow-gcp-template variable TLS Key selector is invalid;  
environment template kubeflow-gcp-template variable Rafay Domain selector is invalid.

Possible Cause

  • The Config Context for system-dns-config is out of date.
  • The GitOps Agent is not running the latest image, leading to invalid selector references.

Resolution Steps

  • Update the GitOps Agent by pulling the latest Docker image:
docker pull <image_name>
  • Redeploy the environment template after updating the agent.

Note: This issue should not occur in production but may appear in development environments.

d. Deployment Failure - Invalid API Key in Function Call

Error Message

Error: Error in function call

  on outputs.tf line 9, in output "host":
   9:   value = yamldecode(data.rafay_download_kubeconfig.kubeconfig_cluster.kubeconfig).clusters[0].cluster.server
    ├────────────────
    │ while calling yamldecode(src)
    │ data.rafay_download_kubeconfig.kubeconfig_cluster.kubeconfig is """"

Call to function "yamldecode" failed: on line 1, column 1: missing start of
document.

Possible Cause

  • This error occurs when an incorrect API Key is used for the deployment.
  • The API Key provided does not match the organization the deployment is running in, leading to a failure when retrieving the Kubernetes configuration.

Resolution Steps

  • Verify that the API Key being used is valid and active.
  • Ensure the API Key corresponds to the correct organization where the deployment is running.
  • If necessary, generate a new API Key from the Rafay Controller UI and update the deployment configuration.

Note: - This issue is commonly caused by incorrect API Key usage. - Ensuring the API Key matches the deployment organization will prevent this error.

e. CSRF Check Failed

Error Message

CSRF check failed. This may happen if you opened the login form in more than 1 tab. Please try to log in again.

Possible Cause

  • This error occurs when attempting to access the Kubeflow UI and logging in via Okta immediately after deployment.
  • The DNS entry may not have fully propagated, causing temporary login failures.

Resolution Steps

  • Wait for 1-5 minutes and try logging in again.
  • If the issue persists, wait for 2-7 minutes to allow DNS propagation to complete.
  • Clear browser cache and cookies before retrying.

Note: This issue is temporary and usually resolves once DNS propagation is complete.

f. Access Denied After Successful Environment Deployment

Error Message

Kubeflow UI leads to the following error after successful environment deployment.

Access Denied

Possible Cause

  • The oidc-authservice-0 Pod in the cluster has not initialized properly.
  • This prevents proper authentication, leading to an access denial.

Resolution Steps

  • Navigate to Infrastructure → Clusters → <underlying_cluster_name> → Resources → Pods.
  • Locate oidc-authservice-0 in the istio-system namespace.
  • Delete the Pod to force a restart.

Best Practices

  • The pod will automatically restart in a Container Initializing sequence.
  • If it comes back up with a Running status (1/1), the Kubeflow UI should be accessible.

Capacity Issues

Below are some capacity-related issues that can occur across templates.

a. Helm Release Name Already in Use

Error Message

Error: cannot re-use a name that is still in use

  with helm_release.feast,
  on main.tf line 73, in resource "helm_release" "feast":
  73: resource "helm_release" "feast" {

time=2025-01-07T01:15:00.755Z level=ERROR msg="failed to run open tofu job" error-source=provider error="exit status 1

Error: cannot re-use a name that is still in use

  with helm_release.feast,
  on main.tf line 73, in resource "helm_release" "feast":
  73: resource "helm_release" "feast" {

Possible Cause

  • The Helm release name is already in use, preventing reinstallation.
  • This issue often occurs when the same environment deployment has been redeployed multiple times (2-4 times or more) without proper cleanup.

Resolution Steps

  • Navigate to Infrastructure → Clusters → <underlying_cluster_name> → Kubectl. Run the following command to list Helm releases:

helm ls -A
- Identify the conflicting resource name (feast). - Uninstall the existing Helm release by running:

helm uninstall feast -n feast
  • feast is both the resource name and the namespace in this case.
  • If unsure about the namespace, locate it under: - Infrastructure → Clusters → <underlying_cluster_name> → Resources → Pods/Deployments

Note: This issue commonly occurs when an environment deployment is redeployed multiple times without cleaning up previous Helm releases.

b. EOF Error Preventing Request Execution

Error Message

Error: 1 error occurred:
    * an error on the server ("EOF") has prevented the request from succeeding (post serviceaccounts)

Possible Cause

  • This error typically occurs if: - The Google account or service account used for deployment has been signed out. - A connectivity issue interrupted the deployment process.

Resolution Steps - Redeploy the Environment run to resume the process from where it left off. - Ensure that the Google account/service account is active and signed in. - Check for any network connectivity issues before redeploying.

Note: This is a temporary issue that can be resolved by redeploying.

c. Cluster Configuration Issue - Workload Identity Not Enabled

Error Message

Option: Enable Workload Identity

Must be configured to True if the underlying cluster is deployed via the `system-gke-cluster` template, or the `Enable Workload Identity` checkbox must be checked if deployed via Rafay Controller UI's `New Cluster` Provisioning.

Possible Cause

  • Workload Identity was not enabled during cluster creation.
  • If deployed using the system-gke-cluster template, the Enable Workload Identity option must be set to True.
  • If deployed via the Rafay Controller UI, the Enable Workload Identity checkbox must be checked.

Resolution Steps

  • Verify Cluster Settings: - Check if Workload Identity is enabled in GCP Kubernetes cluster settings.
  • If Workload Identity is not enabled: - Delete the existing cluster. - Create a new cluster with Workload Identity enabled.
  • Ensure the required IAM roles are assigned for authentication.

Best Practices

If Workload Identity is not enabled, Kubeflow deployment will fail to bring up MLOps services properly.