ML engineers running AI workflows on Kubernetes face a fundamentally different problem than teams running standard web services. GPU scarcity, multi-hour training jobs, distributed coordination across nodes, and the split between training and serving infrastructure all demand tooling that was purpose-built for this environment. By 2026 the market has settled around a short list of tools that have proven themselves at scale. This guide covers each one with real deploy commands, status checks, and debugging patterns you can use today.
What Makes AI Workflow Orchestration on Kubernetes Different
Standard pipeline orchestration is mostly about sequencing tasks and handling retries. AI workflow orchestration on Kubernetes adds several layers of complexity that general-purpose tools handle poorly.
GPU scheduling. GPUs are not schedulable like CPU or memory. The NVIDIA device plugin exposes nvidia.com/gpu as a resource, but getting a pod onto the right node requires combining resource requests, node selectors, and tolerations. A misconfigured pod silently lands in Pending state while a GPU node sits idle.
# Check GPU capacity across all nodes
kubectl get nodes -o json | \
jq '.items[] | {name: .metadata.name, gpu: .status.capacity["nvidia.com/gpu"]}'
# Verify the NVIDIA device plugin is running
kubectl get daemonset -n kube-system nvidia-device-plugin-daemonset
# Find all pods currently requesting GPU
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.containers[].resources.requests["nvidia.com/gpu"] != null) | \
{namespace: .metadata.namespace, name: .metadata.name, gpu: .spec.containers[].resources.requests["nvidia.com/gpu"]}'
Long-running jobs. A training run that takes eight hours is not a batch job in the traditional sense. It needs checkpointing, graceful restart on preemption, and observability at the replica level — not just at the pod level.
Multi-node coordination. Distributed training using NCCL or MPI requires all replicas to start together, communicate over high-bandwidth interconnects, and fail gracefully if any replica crashes. The controller managing the job needs to understand this topology, not just restart individual pods.
Model serving vs. training. Inference workloads have completely different scaling patterns: they are latency-sensitive, need horizontal autoscaling, and often require GPU time-slicing or MIG partitioning. Mixing serving and training into the same orchestration layer creates resource contention.
With that context, here are the tools that ML and platform teams are actually running in production.
Tool 1: KubeFlow — Most Complete ML Platform
KubeFlow is the most comprehensive open-source ML platform for Kubernetes. It ships with Pipelines (DAG execution), Training Operator (PyTorchJob, TFJob, MXJob, XGBoostJob), and KServe for model serving. The tradeoff is footprint: a full KubeFlow install runs dozens of pods across multiple namespaces.
Installation
# Install KubeFlow Pipelines v2.2.0 via Kustomize
export PIPELINE_VERSION=2.2.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=$PIPELINE_VERSION"
# Confirm all components are running
kubectl get pods -n kubeflow
kubectl get svc -n kubeflow
# Access the Pipelines UI locally
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
Submitting Distributed Training Jobs
The Training Operator manages multi-replica jobs as a single resource. Here is a four-node PyTorchJob for LLM fine-tuning:
cat > pytorch-training.yaml << 'EOF'
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: llm-fine-tune
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: my-training-image:latest
resources:
limits:
nvidia.com/gpu: "1"
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: my-training-image:latest
resources:
limits:
nvidia.com/gpu: "1"
EOF
kubectl apply -f pytorch-training.yaml
# Check job status
kubectl get pytorchjob -n kubeflow
kubectl describe pytorchjob llm-fine-tune -n kubeflow
# Stream logs from the master replica
kubectl logs -n kubeflow -l job-name=llm-fine-tune,replica-type=master --follow
Strengths: Complete ecosystem with no gaps between training, pipelines, and serving. Google-backed with a large community and active releases. KServe handles production model serving with canary rollouts and autoscaling.
Weaknesses: Complex to upgrade across minor versions. YAML sprawl is significant at scale. The full platform footprint is not appropriate for teams that only need one component.
Tool 2: KubeRay (Ray on Kubernetes)
KubeRay is the Kubernetes operator for Ray, the distributed Python framework built at UC Berkeley's RISELab. In 2026, KubeRay is the dominant choice for LLM inference workloads because of the tight integration between Ray and vLLM, the leading open-source inference engine.
Installation and Cluster Deployment
# Install the KubeRay operator via Helm
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator \
-n ray-system --create-namespace
# Deploy a GPU-enabled Ray cluster
cat > ray-cluster.yaml << 'EOF'
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: raycluster-gpu
namespace: ray-jobs
spec:
rayVersion: '2.40.0'
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.40.0-gpu
resources:
limits:
nvidia.com/gpu: "1"
cpu: "4"
memory: "16Gi"
workerGroupSpecs:
- groupName: gpu-workers
replicas: 4
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.40.0-gpu
resources:
limits:
nvidia.com/gpu: "2"
cpu: "8"
memory: "32Gi"
EOF
kubectl apply -f ray-cluster.yaml
Submitting a Ray Job
kubectl apply -f - << 'EOF'
apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
name: llm-inference-job
namespace: ray-jobs
spec:
rayClusterSpec:
headGroupSpec:
template:
spec:
containers:
- image: rayproject/ray-ml:2.40.0-gpu
name: ray-head
entrypoint: "python /app/inference.py"
runtimeEnvYAML: |
pip:
- vllm==0.4.0
EOF
# Check cluster and job status
kubectl get raycluster -n ray-jobs
kubectl get rayjob -n ray-jobs
# Stream head node logs
kubectl logs -n ray-jobs \
-l ray.io/cluster=raycluster-gpu,ray.io/node-type=head
Strengths: Best option for vLLM inference on Kubernetes, RayServe supports multi-model serving with traffic splitting, and the distributed Python programming model maps naturally to LLM workloads.
Weaknesses: Python-only. Debugging distributed Ray tasks requires understanding Ray's actor and task model, which has a learning curve distinct from standard Kubernetes debugging.
Tool 3: Argo Workflows
Argo Workflows is the most widely deployed workflow engine for Kubernetes AI pipelines. It does not have GPU-specific abstractions — training jobs are just containers with nvidia.com/gpu in their resource spec. That simplicity is also its strength: Argo Workflows composes well with any containerized workload and is the backbone of many MLOps platforms.
Installation and Pipeline Submission
# Install Argo Workflows
kubectl create namespace argo
kubectl apply -n argo \
-f https://github.com/argoproj/argo-workflows/releases/latest/download/install.yaml
# Define and submit a multi-step training pipeline
cat > training-workflow.yaml << 'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: model-training-pipeline
spec:
entrypoint: train-and-evaluate
templates:
- name: train-and-evaluate
steps:
- - name: data-prep
template: preprocess
- - name: train
template: train-model
- - name: evaluate
template: evaluate-model
- name: preprocess
container:
image: my-data-prep:latest
command: [python, /app/preprocess.py]
- name: train-model
container:
image: my-training:latest
command: [python, /app/train.py]
resources:
limits:
nvidia.com/gpu: "1"
- name: evaluate-model
container:
image: my-eval:latest
command: [python, /app/evaluate.py]
EOF
argo submit -n argo training-workflow.yaml --watch
Debugging Failed Steps
# Inspect overall workflow status
argo get -n argo model-training-pipeline
# Get logs for all containers in the workflow
kubectl logs -n argo \
-l workflows.argoproj.io/workflow=model-training-pipeline \
-c main
# Retry a failed workflow from the point of failure
argo retry -n argo model-training-pipeline
Strengths: Low Kubernetes complexity. Mature, widely adopted, excellent UI. Works with any container image and any cloud provider.
Weaknesses: No built-in distributed training abstractions. For multi-node GPU jobs, Argo typically wraps a KubeFlow Training Operator job, adding another layer.
Tool 4: Flyte
Flyte is a strongly typed, reproducibility-focused workflow platform that started at Lyft and is now maintained by Union.ai. Its key differentiator is built-in caching: if a workflow step has already run with the same inputs, Flyte skips it on retry. For ML pipelines where data preprocessing can take hours, this matters.
Installation and Workflow Execution
# Install Flyte via Helm (EKS values as example)
helm repo add flyteorg https://flyteorg.github.io/flyte
helm install flyte-core flyteorg/flyte-core \
-n flyte --create-namespace \
--values https://raw.githubusercontent.com/flyteorg/flyte/master/charts/flyte-core/values-eks.yaml
# Register and run a workflow remotely
pyflyte run \
--remote \
--project my-ml-project \
--domain production \
my_workflow.py train_model \
--dataset s3://my-bucket/training-data \
--epochs 10
# Check workflow pods for a specific execution
kubectl get workflows -n flyte
kubectl get pods -n flyte -l "flyte.org/execution-id=<execution-id>"
Strengths: Type-safe Python SDK catches errors at registration time rather than at runtime. Built-in data lineage and artifact versioning. Caching makes iterative development significantly faster.
Weaknesses: Smaller community than KubeFlow or Argo. Initial cluster setup is more involved than Argo, and the mental model of typed tasks and workflows requires some onboarding time.
Tool 5: Prefect with Kubernetes Worker
Prefect takes the most developer-friendly approach to Kubernetes AI pipelines. Flows are plain Python functions decorated with @flow and @task. Local execution and Kubernetes execution use the same code path, which eliminates a class of environment parity bugs common in other tools.
Installation and Job Submission
# Install Prefect worker with GPU support
helm install prefect-worker prefect/prefect-worker \
-n prefect --create-namespace \
--set worker.config.workPool=k8s-gpu-pool \
--set worker.resources.limits."nvidia\.com/gpu"=1
# Verify the worker is running
kubectl get pods -n prefect -l app=prefect-worker
# Trigger a GPU training deployment from the CLI
prefect deployment run 'train-model/production' \
--param epochs=50 \
--param learning_rate=0.001
Strengths: Best developer experience of any tool on this list. Local testing mirrors Kubernetes production closely. The Prefect UI provides excellent run history and scheduling.
Weaknesses: The smoothest experience depends on Prefect Cloud for the orchestration backend. Self-hosted Prefect Server works but requires additional operational effort.
Common Kubernetes AI Workflow Debugging Patterns
These commands apply across all five tools. When a training job misbehaves, this is the standard diagnostic sequence.
# Find out why a training pod is stuck in Pending
kubectl describe pod -n ml-training <pod-name> | grep -A20 "Events:"
# List all failed jobs across all namespaces with their failure reason
kubectl get jobs --all-namespaces -o json | \
jq '.items[] | select(.status.conditions[]?.type == "Failed") | \
{ns: .metadata.namespace, name: .metadata.name, reason: .status.conditions[].reason}'
# Debug GPU node affinity and taint mismatches
kubectl get pod <pod> -n ml-training -o jsonpath='{.spec.affinity}'
kubectl get nodes -o json | \
jq '.items[] | select(.spec.taints != null) | {name: .metadata.name, taints: .spec.taints}'
# Stream logs from all replicas in a distributed training job simultaneously
kubectl get pods -n ml-training -l pytorch-job-name=my-training-job
kubectl logs -n ml-training \
-l pytorch-job-name=my-training-job \
--all-containers=true --prefix=true
# Get wall-clock training duration for a completed job
kubectl get job <job-name> -n ml-training \
-o jsonpath='{.status.startTime} -> {.status.completionTime}'
# Check for OOM kills across all namespaces
kubectl get events --all-namespaces --field-selector reason=OOMKilling
# GPU utilization via DCGM exporter
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter
kubectl exec -n gpu-operator <dcgm-pod> -- dcgmi dmon -e 203,204
The GPU utilization check is particularly useful for diagnosing idle GPU nodes — a common issue in Kubernetes ML environments where pod scheduling fails silently and the GPU node sits at zero utilization while jobs queue.
ClankerCloud.ai for ML Workflow Operations
Running these debugging commands manually works, but it requires knowing which namespace a job is in, which label selectors apply to which tool, and how to correlate logs across replicas. ClankerCloud.ai is a local-first AI workspace that connects directly to your Kubernetes cluster and answers operational questions in plain language.
For the workflows covered in this article, the equivalent clanker ask commands are:
# Find all KubeFlow training failures in the last 24 hours
clanker ask "show me all KubeFlow training jobs that failed in the last 24 hours with their error reason"
# Identify underutilized Ray cluster nodes
clanker ask "which Ray cluster nodes are underutilized right now"
# Surface repeatedly retrying Argo workflow steps
clanker ask "find Argo workflow steps that have been retrying more than 3 times this week"
# Track average training job completion time
clanker ask "what is the average training job completion time for my PyTorchJob workloads this month"
For broader infrastructure audits, the Deep Research feature fans out across all connected providers simultaneously:
clanker ask "run a deep scan of my ML training infrastructure — find stuck jobs, GPU scheduling failures, and idle GPU nodes"
This runs parallel agent swarms across your cluster and returns severity-ranked findings exportable as JSON or Markdown. Full documentation is at docs.clankercloud.ai.
ClankerCloud.ai integrates with AI agent frameworks as an MCP server. If you are building AI-driven DevOps pipelines, see the AI agents integration page and the AI DevOps for teams overview. For platform engineering teams taking vibe-coded prototypes to production Kubernetes, the vibe-coding-to-production guide covers the full path.
Tool Comparison
| Tool | GPU-native | Distributed training | Model serving | K8s complexity | Best for |
|---|---|---|---|---|---|
| KubeFlow | Yes | Yes (Training Operator) | Yes (KServe) | High | Full ML platform |
| KubeRay | Yes | Yes (Ray distributed) | Yes (RayServe) | Medium | LLM inference and fine-tuning |
| Argo Workflows | Via pod spec | Via template steps | No | Low | General AI pipelines |
| Flyte | Yes | Via Python SDK | No | Medium | Reproducible ML |
| Prefect | Yes | Via flow steps | No | Low | Python-first teams |
Frequently Asked Questions
What is the best Kubernetes orchestration tool for AI workflows in 2026?
There is no single best tool — the right choice depends on your workload. KubeFlow is the best fit for teams that need a complete ML platform covering training, pipelines, and serving. KubeRay is the leading choice for LLM inference and distributed fine-tuning. Argo Workflows is the most portable and widely adopted option for general ML pipelines. Flyte wins on reproducibility and caching. Prefect wins on developer experience.
Most production ML platforms at scale use two of these tools in combination: commonly Argo Workflows or Flyte for pipeline DAGs, and KubeFlow Training Operator or KubeRay for distributed training jobs.
How do I run distributed GPU training on Kubernetes with KubeFlow?
Install the KubeFlow Training Operator and define a PyTorchJob resource specifying Master and Worker replica counts with nvidia.com/gpu resource limits in each container spec. The operator manages pod scheduling, inter-replica networking, and restart behavior. Use kubectl get pytorchjob -n kubeflow and kubectl describe pytorchjob <name> -n kubeflow to monitor status. Stream logs from all replicas with kubectl logs -n kubeflow -l job-name=<job-name> --all-containers=true --prefix=true.
What is KubeRay and how does it compare to KubeFlow for LLM workloads?
KubeRay is the Kubernetes operator for the Ray distributed computing framework. Compared to KubeFlow, KubeRay is easier to operate at the cluster level (fewer components) but requires understanding the Ray programming model at the application level. For LLM inference specifically, KubeRay with vLLM is the most mature path in 2026: the RayJob and RayCluster CRDs make it straightforward to deploy multi-GPU inference endpoints with RayServe handling traffic routing and autoscaling. KubeFlow's KServe is the alternative for teams already on the KubeFlow platform.
How do I debug a stuck Kubernetes ML training job?
Start with kubectl describe pod -n <namespace> <pod-name> and read the Events section. The most common causes of stuck training pods are: missing GPU resources on schedulable nodes (check node capacity with kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, gpu: .status.capacity["nvidia.com/gpu"]}'), node taints that the pod spec does not tolerate, and image pull errors. For jobs that start but stall mid-training, check for OOM kills with kubectl get events --all-namespaces --field-selector reason=OOMKilling and GPU utilization with the DCGM exporter. For distributed jobs, stream logs from all replicas simultaneously with --all-containers=true --prefix=true to identify which replica is blocking.
Get Started
If you want to see how ClankerCloud.ai connects to your existing Kubernetes ML infrastructure, book a demo or create a free account. The beta tier is free and supports AWS, GCP, Azure, and self-hosted Kubernetes clusters.
For common questions about setup and supported providers, see the FAQ.
Ask Clanker Cloud what your cluster is doing
Install the local app, connect your kubeconfig, and turn cluster state, workload health, cost context, and safe next steps into one readable answer.
