You have already evaluated the tools, weighed Airbyte against Meltano, and decided what to run. Now the work begins: getting your ETL stack onto Kubernetes without triggering a 2 a.m. incident on the first sync.
This guide is purely operational. It covers namespace and RBAC setup, deploying Airbyte with Helm, deploying Meltano as a Kubernetes CronJob, setting resource limits and HorizontalPodAutoscaler rules, and diagnosing the failure modes that appear most often in production. Every command is ready to copy and run against a real cluster.
If you are still evaluating tools, the AI DevOps for Teams guide covers the earlier decision stage.
1. Pre-Deployment Checklist
Before applying any manifests, confirm these prerequisites:
- Kubernetes version ≥ 1.27. Airbyte's Helm chart and the CronJob batch/v1 API require it. Check with
kubectl version --short. - Helm 3.x installed. Version 2 is end-of-life. Confirm with
helm version. - A StorageClass is available. Airbyte requires persistent storage for MinIO and PostgreSQL. Run
kubectl get storageclassand note the name of your default class. On EKS this is typicallygp2orgp3; on GKE it isstandard-rwo. If no default is set, the Airbyte PersistentVolumeClaims will remain inPendingstate indefinitely. - Sufficient node capacity. A minimal Airbyte install requires roughly 4 vCPU and 8 GB RAM across the namespace. Worker pods add to this per-sync.
- Outbound network access from pods if your sources or destinations are external.
Skipping any of these is the most common reason an ETL tools Kubernetes deployment fails silently at the PVC or scheduler layer.
2. Namespace and RBAC Setup
Isolate your data pipeline workloads in a dedicated namespace. This keeps resource quotas, network policies, and RBAC scoped correctly and prevents noisy-neighbor effects from application workloads.
# Namespace + RBAC
kubectl create namespace data-pipeline
kubectl create serviceaccount etl-runner -n data-pipeline
For production clusters, attach a Role and RoleBinding to that ServiceAccount. Create a file rbac.yaml:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: etl-runner-role
namespace: data-pipeline
rules:
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: etl-runner-rolebinding
namespace: data-pipeline
subjects:
- kind: ServiceAccount
name: etl-runner
namespace: data-pipeline
roleRef:
kind: Role
name: etl-runner-role
apiGroup: rbac.authorization.k8s.io
Apply it:
kubectl apply -f rbac.yaml
The etl-runner ServiceAccount now has permission to manage batch jobs and read pod logs within data-pipeline, and nothing beyond that. This matters when you later attach a CI pipeline or an AI agent that drives deployments — it cannot accidentally touch production namespaces. For teams wiring up AI-assisted workflows, see the notes on agent-compatible tooling.
3. Deploying Airbyte with Helm
Airbyte is the most common choice for a Helm-deployed, self-hosted ETL platform on Kubernetes. The official Helm chart handles the server, worker, temporal, MinIO, and PostgreSQL components as a single release.
Add the repo and update:
helm repo add airbyte https://airbytehq.github.io/helm-charts
helm repo update
Install with resource overrides:
helm install airbyte airbyte/airbyte \
--namespace airbyte \
--create-namespace \
--set global.deploymentMode=oss \
--set server.resources.requests.memory=512Mi \
--set server.resources.limits.memory=2Gi \
--set worker.resources.requests.memory=1Gi \
--set worker.resources.limits.memory=4Gi
The --set flags shown here are the minimum overrides for a non-trivial workload. The default chart values are conservative; the worker pod OOMKills that appear in most support threads come from leaving worker.resources.limits.memory at the chart default (often 1Gi) while running syncs against large Postgres sources.
For repeatable deployments, extract these into a values-override.yaml and use -f values-override.yaml instead of individual --set arguments. Keep that file in version control.
Verify the release:
kubectl get pods -n airbyte
kubectl get svc -n airbyte
All pods should reach Running or Completed within 3–5 minutes on a healthy cluster. If airbyte-server or airbyte-temporal stays in ContainerCreating, the most common cause is a PVC that has not bound. Check with:
kubectl get pvc -n airbyte
kubectl describe pvc airbyte-minio -n airbyte
The describe output will show StorageClass and FailedBinding events if there is a mismatch.
Full configuration options are documented at docs.clankercloud.ai alongside Kubernetes observability patterns for Airbyte-managed namespaces.
4. Deploying Meltano as a Kubernetes CronJob
Meltano's architecture is different from Airbyte's: it is a command-line EL runner rather than a long-running service. That makes it a natural fit for Kubernetes CronJobs — the cluster schedules the pod, runs the sync, and terminates it. You pay for compute only during the sync window.
Create a file named meltano-cronjob.yaml:
apiVersion: batch/v1
kind: CronJob
metadata:
name: meltano-el-job
namespace: data-pipeline
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: meltano
image: meltano/meltano:latest
command:
[
"meltano",
"run",
"tap-postgres",
"target-bigquery",
]
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
restartPolicy: OnFailure
Apply it and confirm:
kubectl apply -f meltano-cronjob.yaml
kubectl get cronjobs -n data-pipeline
A few production notes on this manifest:
- Set
image: meltano/meltano:2.x.xwith a pinned tag instead oflatest. Floating tags cause hard-to-reproduce failures when upstream pushes a breaking change. - Add
serviceAccountName: etl-runnerunderspec.template.specif the job needs to read Kubernetes secrets for connector credentials. - If your BigQuery target requires a service account JSON, mount it as a Kubernetes Secret volume rather than baking it into the image.
- Set
successfulJobsHistoryLimit: 3andfailedJobsHistoryLimit: 5at the CronJob spec level to prevent history accumulating indefinitely.
For a broader look at how this pattern fits into a production data platform strategy, the vibe coding to production guide walks through the full lifecycle from prototype to hardened deployment.
5. Resource Limits and HorizontalPodAutoscaler
Static resource limits work for predictable single-table syncs. For variable workloads — Airbyte workers running parallel syncs across dozens of connections — you need HPA.
The Airbyte worker pool is the primary candidate. A basic HPA targeting CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: airbyte-worker-hpa
namespace: airbyte
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: airbyte-worker
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Apply with kubectl apply -f airbyte-worker-hpa.yaml.
For memory-bound workloads (wide-column schemas, large JSON payloads), consider targeting memory utilization instead, or use custom metrics from Prometheus if your cluster runs it. The key rule: HPA requires that resources.requests be set on the target pods — without requests, the metrics server has no baseline to calculate utilization against. If you set limits without requests in the Helm install step above, the HPA will emit missing request for memory warnings and refuse to scale.
6. Monitoring and Debugging
Real-time resource usage:
kubectl top pods -n airbyte
kubectl top pods -n data-pipeline
This requires the metrics-server to be running in the cluster. On EKS, it is not installed by default.
Inspecting a specific pod:
kubectl describe pod <pod-name> -n airbyte
kubectl logs <pod-name> -n airbyte --previous
The --previous flag is essential: it returns logs from the terminated container instance, not the restarted one. Without it, you see the fresh restart's startup logs and miss the crash context entirely.
Patterns to recognize in kubectl describe output:
OOMKilledin theLast Statesection → memory limit was hit. TheExit Code: 137confirms it.Back-off pulling image→ registry authentication issue or tag does not exist.0/3 nodes are available: 3 Insufficient memory→ cluster is out of capacity; the pod is pending, not crashing.MountVolume.SetUp failed→ PVC issue, StorageClass problem, or the backing volume was deleted.
7. Common Failure Modes and How to Resolve Them
These are the four failure patterns that appear most frequently in production ETL Kubernetes deployments.
OOMKilled: Worker Pod Hits Memory Limit Mid-Sync
Symptom: kubectl describe pod <pod-name> -n airbyte shows OOMKilled under Last State. Exit Code 137.
Cause: The worker pod is loading a full table snapshot or large batch into memory and exceeding worker.resources.limits.memory.
Fix:
helm upgrade airbyte airbyte/airbyte \
--namespace airbyte \
--set worker.resources.limits.memory=8Gi \
--reuse-values
Increase in 2Gi increments until syncs complete cleanly. For syncs against very large tables, use Airbyte's incremental sync mode to reduce per-sync payload size.
ImagePullBackOff: Private Registry Missing imagePullSecrets
Symptom: kubectl describe pod <pod-name> -n airbyte shows Failed to pull image ... unauthorized.
Cause: The pod spec does not reference a registry credential secret.
Fix:
kubectl create secret docker-registry regcred \
--docker-server=your.registry.io \
--docker-username=<username> \
--docker-password=<password> \
--namespace data-pipeline
Then add imagePullSecrets: [{name: regcred}] to the pod spec or CronJob template.
PVC Binding Issues: StorageClass Mismatch
Symptom: kubectl get pvc -n airbyte shows PVCs in Pending state indefinitely.
Fix:
kubectl get storageclass
kubectl describe pvc airbyte-minio -n airbyte
The describe output lists the requested StorageClass. If it does not match any available class, override it in the Helm values:
helm upgrade airbyte airbyte/airbyte \
--namespace airbyte \
--set global.storageClass=gp3 \
--reuse-values
CrashLoopBackOff in Airbyte Server: DB Connection String Wrong
Symptom: airbyte-server pod is in CrashLoopBackOff. Logs show connection refused or role does not exist.
Fix: Inspect the ConfigMap:
kubectl get configmap -n airbyte
kubectl describe configmap airbyte-airbyte-env -n airbyte
Check the DATABASE_URL value. If you overrode the default PostgreSQL with an external database, the hostname, port, username, or database name in the ConfigMap is incorrect. Edit via kubectl edit configmap or pass the correct values via Helm --set and run helm upgrade.
8. Clanker Cloud: Ask About Your Deployment Instead of Hunting Logs
The debugging workflow above works, but it is slow. kubectl describe followed by kubectl logs --previous followed by cross-referencing Helm values is a three-window operation that takes longer when the incident is at 2 a.m.
Clanker Cloud connects to your Kubernetes cluster — EKS, GKE, AKS — and reads live pod logs and resource metrics directly. Instead of running kubectl top pods -n airbyte and then manually correlating output against your Helm values, you ask:
"Why is my Airbyte worker pod OOMKilled?"
Clanker Cloud reads the pod's Last State, current resource limits, recent log output, and current node capacity, then returns a plain-English diagnosis: worker limit set to 2Gi, peak usage was 3.1Gi during the tap-postgres sync on the orders table, suggested fix is to raise to 4Gi or enable incremental mode.
The Deep Research feature extends this further. Instead of auditing a single pod, it fans out across the entire airbyte and data-pipeline namespaces — checking PVC health, resource utilization, CronJob success history, RBAC gaps — and returns a severity-graded health report. You can export the findings as JSON or Markdown for a post-deployment review.
BYOK model support means the analysis runs on whatever model fits your context: Gemma 4 via Ollama (gemma4:31b) if you are running fully local, Hermes (hermes3:70b) for open-weight analysis, Claude Code or Codex when you want a reasoning model to suggest specific Helm value changes. No vendor lock-in on the AI layer.
The local-first architecture means your kubeconfig and cluster credentials never leave your machine. Clanker Cloud reads the cluster through a local agent process; the AI model sees resource data, not raw credentials.
If your team is working through the broader AI DevOps adoption curve, the AI DevOps for Teams guide covers how to structure that rollout. For agents already integrated into your pipeline, Clanker Cloud's agent compatibility layer documents the MCP interface. Start at the demo to see live cluster queries against a sample Kubernetes environment, and review the FAQ for common setup questions.
Create or sign into your account at clankercloud.ai/account.
9. FAQ
What Kubernetes version do I need to deploy Airbyte with Helm in 2026?
Kubernetes 1.27 or later. Airbyte's Helm chart relies on batch/v1 CronJob and autoscaling/v2 HPA APIs that were graduated to stable in 1.25–1.27. Running on 1.24 or earlier will cause API version mismatches during helm install.
Why is my Airbyte worker pod stuck in OOMKilled after increasing memory limits?
If you ran helm upgrade with new limits but the pod was not restarted, the old limits are still active. Confirm the rollout completed with kubectl rollout status deployment/airbyte-worker -n airbyte. Also check that the node itself has enough allocatable memory — kubectl describe node <node-name> shows the allocatable field. A 4Gi limit on a node with 3.5Gi remaining will still OOMKill.
How do I run Meltano on a schedule without Kubernetes?
Meltano supports native scheduling via meltano schedule with Airflow or the built-in meltano run command in CI. However, the Kubernetes CronJob approach has lower operational overhead: no Airflow installation, no scheduler database, and the pod terminates when the job finishes so you pay for compute only during the sync window.
Can I deploy both Airbyte and Meltano on the same cluster?
Yes. Keep them in separate namespaces (airbyte and data-pipeline) with separate resource quotas. Define a ResourceQuota per namespace to prevent one workload from consuming all cluster memory. Use Airbyte for complex multi-destination syncs with its UI-managed connections, and Meltano CronJobs for lightweight scheduled extracts where the connector ecosystem (Singer taps) is sufficient.
10. Get Started
A production-ready ETL tools Kubernetes deployment is not just running helm install — it is namespace isolation, correct resource limits, HPA configuration, and a debugging workflow that works under pressure.
Clanker Cloud connects to your cluster today. Query pod state, diagnose OOMKilled workers, and run a full pipeline health audit with Deep Research — all from a local desktop app where your credentials never leave the machine.
Sign in or create your account to connect your first cluster. Full setup documentation is at docs.clankercloud.ai.
Need the product-level answer?
Use the DevOps page for the canonical product answer on Kubernetes operations, live context, and reviewed infrastructure workflows.
