Skip to main content
Back to blog

Which ETL Tools Support Containerized or Kubernetes-Based Deployment?

Merged into the canonical ETL Kubernetes deployment guide to keep one stable operational page for the topic.

Merged article

This topic now lives on one canonical page

This ETL deployment overview was merged into the canonical operational guide with the clearest namespace, RBAC, and failure-mode coverage.

Read the canonical article

Not every ETL tool can run on Kubernetes. The market splits cleanly between managed SaaS tools that handle infrastructure for you — and open-source, self-hosted tools that ship with real Helm charts, operators, or at minimum a Docker image you can run anywhere. This guide covers the self-hosted ETL tools containerized Kubernetes deployment patterns that actually work in 2026, with working kubectl and helm commands for each.


The Split in the ETL Market

ETL tooling divides into two distinct camps when it comes to deployment model.

Managed SaaS tools — Fivetran, Stitch (now Talend Cloud), and similar platforms — abstract the infrastructure entirely. You connect your sources, configure your destinations, and Fivetran handles running the sync jobs on their cloud. There is no Kubernetes option because there is nothing for you to deploy. Data flows through their cloud, which creates real constraints around data residency, compliance, and cost at scale.

Self-hosted and open-source tools — Airbyte, dbt, Meltano, Apache Spark, and Kafka Connect — give you full deployment control. Several of these have first-class Kubernetes support: official Helm charts, Kubernetes-native job spawning, or dedicated operators. Others run cleanly as containerized CronJobs or batch Jobs that fit natively into any K8s workload model.

Why run ETL on Kubernetes at all? The practical reasons:

  • Data residency: your data never leaves your cluster or VPC — critical for HIPAA, GDPR, and financial sector compliance
  • Cost at scale: at high sync volumes, managed SaaS per-row or per-connector pricing gets expensive fast; self-hosted on spot/preemptible nodes can cut costs significantly
  • Custom connectors: build and deploy connectors you own, with no waiting for a SaaS provider to add your source
  • Orchestration integration: plug ETL jobs directly into Argo Workflows, Apache Airflow on Kubernetes, or your existing CI/CD pipelines

What "Kubernetes support" actually means varies by tool. Airbyte spawns each sync as a real Kubernetes Job — genuinely K8s-native. dbt is a process that runs and exits; you wrap it in a K8s Job or CronJob. Meltano containerizes well but has no native operator. Spark has a full operator. Kafka Connect runs best under the Strimzi operator. Understanding these distinctions matters before you choose your stack.

For context on moving data pipelines to production, see the vibe coding to production guide — the same principles apply to ETL pipeline lifecycle management.


Airbyte — Strongest Kubernetes Support

Airbyte has the most complete Kubernetes story of any ETL tool in this list. It ships an official Helm chart, and its job execution model is genuinely K8s-native: each sync run spawns as a Kubernetes Job, not a thread or subprocess inside a monolithic server.

# Add Airbyte Helm repo
helm repo add airbyte https://airbytehq.github.io/helm-charts
helm repo update

# Deploy Airbyte on K8s
helm install airbyte airbyte/airbyte \
  --namespace airbyte \
  --create-namespace \
  --values airbyte-values.yaml

# Check deployment
kubectl get pods -n airbyte
kubectl get svc -n airbyte

# Access Airbyte UI via port forward
kubectl port-forward svc/airbyte-airbyte-webapp-svc 8000:80 -n airbyte

# Check sync job pods (each sync = a K8s Job)
kubectl get pods -n airbyte -l airbyte=job

# Tail logs for a specific sync job pod
kubectl logs -n airbyte -l airbyte=job --follow

# Check persistent volume claims used by Airbyte
kubectl get pvc -n airbyte

The airbyte-values.yaml file controls the most important deployment configuration. For production, you want to point Airbyte at an external database rather than the bundled PostgreSQL:

global:
    storageClass: standard
    database:
        type: external
        host: your-rds-endpoint.us-east-1.rds.amazonaws.com
        port: 5432
        database: airbyte
        user: airbyte
        passwordSecretRef:
            name: airbyte-db-secret
            key: password

Strengths: official Helm chart, K8s-native job spawning (each sync = K8s Job), scales to hundreds of connectors, active upstream Helm chart maintenance, 300+ connectors available out of the box.

Weaknesses: resource-heavy — minimum 4 vCPU and 8 GB RAM for a functional cluster; the Helm values surface is large and initial configuration takes time; the bundled temporal service adds operational complexity.

To monitor your Airbyte deployment with ClankerCloud.ai, try:

clanker ask "find all Airbyte sync job pods that are stuck or pending in the airbyte namespace"
clanker ask "show me PVC usage in the airbyte namespace and flag anything over 80% capacity"

dbt — Containerized as a Kubernetes Job or CronJob

dbt (data build tool) does not have a Kubernetes controller or operator. It is a command-line process that runs transformations and exits. That said, it containerizes cleanly and fits naturally into the Kubernetes Job and CronJob patterns. The official dbt-labs images on GitHub Container Registry are the standard base.

# Run dbt as a one-off Kubernetes Job
cat > dbt-job.yaml << 'EOF'
apiVersion: batch/v1
kind: Job
metadata:
  name: dbt-run
  namespace: data-pipelines
spec:
  template:
    spec:
      containers:
      - name: dbt
        image: ghcr.io/dbt-labs/dbt-bigquery:1.8.0
        command: ["dbt", "run", "--profiles-dir", "/app/profiles", "--target", "prod"]
        envFrom:
        - secretRef:
            name: dbt-credentials
        volumeMounts:
        - name: profiles
          mountPath: /app/profiles
      volumes:
      - name: profiles
        configMap:
          name: dbt-profiles
      restartPolicy: Never
  backoffLimit: 2
EOF
kubectl apply -f dbt-job.yaml

# Watch job logs
kubectl logs -n data-pipelines -l job-name=dbt-run --follow

# Check job status
kubectl get job dbt-run -n data-pipelines

# Create a CronJob for scheduled dbt runs (6 AM daily)
cat > dbt-cronjob.yaml << 'EOF'
apiVersion: batch/v1
kind: CronJob
metadata:
  name: dbt-daily
  namespace: data-pipelines
spec:
  schedule: "0 6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: dbt
            image: ghcr.io/dbt-labs/dbt-bigquery:1.8.0
            command: ["dbt", "run", "--profiles-dir", "/app/profiles", "--target", "prod"]
            envFrom:
            - secretRef:
                name: dbt-credentials
            volumeMounts:
            - name: profiles
              mountPath: /app/profiles
          volumes:
          - name: profiles
            configMap:
              name: dbt-profiles
          restartPolicy: OnFailure
EOF
kubectl apply -f dbt-cronjob.yaml

# Check CronJob history
kubectl get cronjob dbt-daily -n data-pipelines
kubectl get jobs -n data-pipelines --sort-by=.metadata.creationTimestamp

For production dbt on Kubernetes, most teams orchestrate the Job through Argo Workflows or Airflow with the Kubernetes executor — this gives you dependency management, retry logic, and alerting on top of the basic CronJob pattern. dbt Cloud is the managed alternative, but it introduces vendor lock-in and data egress if your warehouse is in a private VPC.

Check your dbt run history without leaving your terminal:

clanker ask "what is my dbt CronJob history in the data-pipelines namespace and did it run successfully today"

Meltano — Containerized EL Pipelines on K8s

Meltano is a CLI-first EL (Extract-Load) framework built on the Singer tap/target ecosystem. It does not have a native Kubernetes operator, but it containerizes cleanly and deploys as a K8s CronJob for scheduled pipeline runs.

# Build a Meltano Docker image
cat > Dockerfile << 'EOF'
FROM python:3.11-slim
RUN pip install meltano
WORKDIR /project
COPY meltano.yml .
RUN meltano install
EOF

docker build -t your-registry/meltano-project:latest .
docker push your-registry/meltano-project:latest

# Deploy as a K8s CronJob (every 6 hours)
cat > meltano-cronjob.yaml << 'EOF'
apiVersion: batch/v1
kind: CronJob
metadata:
  name: meltano-extract-load
  namespace: data-pipelines
spec:
  schedule: "0 */6 * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: meltano
            image: your-registry/meltano-project:latest
            command: ["meltano", "run", "tap-postgres", "target-bigquery"]
            envFrom:
            - secretRef:
                name: meltano-secrets
            env:
            - name: MELTANO_ENVIRONMENT
              value: production
          restartPolicy: OnFailure
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 3
EOF
kubectl apply -f meltano-cronjob.yaml

# Check CronJob status and last run
kubectl get cronjob meltano-extract-load -n data-pipelines
kubectl describe cronjob meltano-extract-load -n data-pipelines

# Get logs from the most recent completed job
kubectl get pods -n data-pipelines -l job-name --sort-by=.metadata.creationTimestamp
kubectl logs -n data-pipelines <most-recent-pod-name>

Strengths: CLI-first design makes it easy to containerize, 600+ Singer taps and targets available, clean separation of environment configuration via meltano.yml, works well for straightforward EL pipelines where you do not need a full orchestration layer.

Weaknesses: no native Kubernetes operator, no built-in DAG support — complex multi-step pipelines require Argo or Airflow on top, limited native retry and alerting compared to Airbyte.

The ai-devops-for-teams workflow fits Meltano deployments well when you have multiple pipelines running as CronJobs across namespaces.


Apache Spark on Kubernetes — spark-operator

Apache Spark has mature Kubernetes support via the spark-on-k8s-operator (maintained by Google Cloud Platform, now under the Kubernetes SIGs umbrella). The operator introduces a SparkApplication CRD and handles driver/executor lifecycle, scaling, and retry.

# Install Spark Operator via Helm
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm repo update

helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --create-namespace \
  --set sparkJobNamespace=spark-jobs \
  --set webhook.enable=true

# Verify the operator is running
kubectl get pods -n spark-operator
kubectl get crd | grep spark

# Submit a Spark ETL job using the SparkApplication CRD
cat > spark-etl.yaml << 'EOF'
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: postgres-to-s3-etl
  namespace: spark-jobs
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "your-registry/spark-etl:latest"
  imagePullPolicy: Always
  mainApplicationFile: "local:///app/etl.py"
  sparkVersion: "3.5.0"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
  driver:
    cores: 2
    coreLimit: "2000m"
    memory: "4g"
    serviceAccount: spark
  executor:
    cores: 4
    instances: 3
    memory: "8g"
EOF
kubectl apply -f spark-etl.yaml

# Monitor Spark application state
kubectl get sparkapplication -n spark-jobs
kubectl describe sparkapplication postgres-to-s3-etl -n spark-jobs

# Get driver logs
kubectl logs -n spark-jobs postgres-to-s3-etl-driver

# List executor pods for a running job
kubectl get pods -n spark-jobs -l spark-role=executor

# Check events for the SparkApplication
kubectl get events -n spark-jobs --field-selector involvedObject.name=postgres-to-s3-etl

Best for: large-scale data transformation where PySpark is already part of the team's skill set, petabyte-scale ETL, teams migrating from on-premise Hadoop clusters to Kubernetes.

clanker ask "show me all failed Spark jobs in the spark-jobs namespace in the last 24 hours"
clanker ask "list any SparkApplication resources in spark-jobs that are in FAILING or UNKNOWN state"

Kafka Connect on Kubernetes — Strimzi Operator

Kafka Connect is the streaming EL layer of the Kafka ecosystem. For Kubernetes deployments, the Strimzi operator is the standard: it manages Kafka, Kafka Connect, and related CRDs as Kubernetes-native resources.

# Install Strimzi Operator
kubectl create namespace kafka

kubectl apply -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka

# Wait for the operator to become ready
kubectl rollout status deployment/strimzi-cluster-operator -n kafka

# Deploy a KafkaConnect cluster (3 replicas)
cat > kafka-connect.yaml << 'EOF'
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
  name: my-connect
  namespace: kafka
  annotations:
    strimzi.io/use-connector-resources: "true"
spec:
  version: 3.7.0
  replicas: 3
  bootstrapServers: my-kafka-bootstrap:9093
  tls:
    trustedCertificates:
      - secretName: my-kafka-cluster-ca-cert
        certificate: ca.crt
  config:
    group.id: connect-cluster
    offset.storage.topic: connect-cluster-offsets
    config.storage.topic: connect-cluster-configs
    status.storage.topic: connect-cluster-status
    config.storage.replication.factor: 3
    offset.storage.replication.factor: 3
    status.storage.replication.factor: 3
EOF
kubectl apply -f kafka-connect.yaml

# Check KafkaConnect cluster status
kubectl get kafkaconnect -n kafka
kubectl describe kafkaconnect my-connect -n kafka

# List connectors via the REST API from inside the cluster
kubectl exec -n kafka my-connect-connect-0 -- \
  curl -s http://localhost:8083/connectors | jq .

# Check status of a specific connector
kubectl exec -n kafka my-connect-connect-0 -- \
  curl -s http://localhost:8083/connectors/my-postgres-source/status | jq .

# Deploy a KafkaConnector resource (CDC from Postgres)
cat > postgres-cdc-connector.yaml << 'EOF'
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: postgres-cdc-source
  namespace: kafka
  labels:
    strimzi.io/cluster: my-connect
spec:
  class: io.debezium.connector.postgresql.PostgresConnector
  tasksMax: 1
  config:
    database.hostname: postgres-svc
    database.port: "5432"
    database.user: replicator
    database.dbname: production
    database.server.name: prod-postgres
    plugin.name: pgoutput
    table.include.list: public.orders,public.customers
    database.password.secret.name: postgres-cdc-secret
    database.password.secret.key: password
EOF
kubectl apply -f postgres-cdc-connector.yaml

Best for: real-time streaming ETL, Change Data Capture (CDC) from relational databases, event-driven architectures where data must flow continuously rather than in scheduled batches.

clanker ask "check if my Kafka Connect cluster is healthy in the kafka namespace and list any failed connectors"

The Managed-Only Tools — No Kubernetes Option

Some tools in the ETL market are SaaS-only by design. Knowing this upfront saves you from spending time looking for a Helm chart that does not exist.

Fivetran has no self-hosted deployment option. All sync jobs run on Fivetran's cloud infrastructure. Your source credentials leave your environment, and pricing is based on monthly active rows — which scales poorly at high data volume. If your compliance requirements prohibit data leaving your VPC, Fivetran is not a viable option.

Stitch (Talend Cloud) similarly runs as managed SaaS. Talend Open Studio is a separate on-premise product, but it is a different codebase and not a drop-in self-hosted version of Stitch.

AWS Glue runs managed PySpark on AWS infrastructure. You cannot deploy it on your own Kubernetes cluster, though you can run Spark on EKS separately using the spark-operator described above. Glue is a good choice when your data stays entirely within AWS and your team wants to avoid managing Spark clusters — but it is not Kubernetes ETL in the self-hosted sense.

The right call for managed tools: small data engineering teams, compliance frameworks that permit third-party data processors, and organizations that want to pay for uptime rather than manage infrastructure. For teams that need data residency, custom connectors, or control over compute costs, self-hosted K8s is the appropriate path. The FAQ covers how to evaluate this tradeoff in more detail.


ClankerCloud.ai for ETL Pipeline Operations

Running ETL workloads on Kubernetes means managing sync job pods, CronJob schedules, PVCs, connector health, and Spark application state across namespaces. ClankerCloud.ai connects to your Kubernetes cluster and lets you query pipeline state in natural language instead of stitching together kubectl commands.

# Install Clanker CLI
brew tap clankercloud/tap && brew install clanker

# Inspect your data pipeline namespace
clanker ask "show me all failed Spark jobs in the spark-jobs namespace in the last 24 hours"
clanker ask "check if my Kafka Connect cluster is healthy and list any failed connectors"
clanker ask "find all Airbyte sync job pods that are stuck or pending in the airbyte namespace"
clanker ask "what is my dbt CronJob history in data-pipelines and did it run successfully today"

# Deep Research scan across the entire data pipeline stack
clanker ask "scan my data pipeline namespace — find stuck jobs, PVC issues, resource bottlenecks, and failed CronJob runs"

The Deep Research feature fans out across your connected providers simultaneously — scanning Kubernetes, AWS, GCP, or Azure in parallel — and returns severity-ranked findings. Results export as Markdown or JSON, useful for incident postmortems or pipeline health reports. This is particularly valuable for the ai-devops-for-teams workflow where multiple data engineers share responsibility for pipeline uptime.

For teams that want AI agents to monitor pipelines autonomously, Clanker exposes an MCP endpoint:

clanker mcp --transport http --listen 127.0.0.1:39393

This lets orchestration tools, agents, or CI/CD pipelines call clanker_run_command and clanker_route_question to inspect ETL job health programmatically. See /for-ai-agents.md and the full docs for integration details.


Comparison Table

Tool Official Helm Chart K8s-native Jobs Operator Pattern Self-host Complexity Best For
Airbyte Yes (official) Yes — each sync = K8s Job No dedicated operator High (4+ vCPU, 8 GB RAM min) 300+ connectors, batch EL at scale
dbt (as K8s Job) No Via Job/CronJob No Low SQL transformations, scheduled runs
Meltano No Via CronJob No Low-medium Singer tap/target EL pipelines
Spark (spark-operator) Yes (via spark-operator) Yes — driver/executor pods Yes (spark-operator) High Large-scale PySpark ETL, petabyte-scale
Kafka Connect (Strimzi) Yes (Strimzi) Yes — connect pods Yes (Strimzi) Medium-high Real-time streaming, CDC
Fivetran None N/A — SaaS only N/A None (managed) Small teams, fast connector coverage

FAQ

Which ETL tools can be deployed on Kubernetes?

The tools with genuine Kubernetes deployment support are Airbyte (official Helm chart, K8s-native job spawning), Apache Spark via the spark-operator, Kafka Connect via the Strimzi operator, Meltano (containerized CronJob pattern), and dbt (K8s Job or CronJob). Managed SaaS tools like Fivetran and Stitch do not offer self-hosted Kubernetes deployment.

How do I deploy Airbyte on Kubernetes with Helm?

Add the Airbyte Helm repository with helm repo add airbyte https://airbytehq.github.io/helm-charts, then install with helm install airbyte airbyte/airbyte --namespace airbyte --create-namespace --values airbyte-values.yaml. In airbyte-values.yaml, configure an external database (RDS or CloudSQL) and your storage class. After deployment, access the UI by port-forwarding svc/airbyte-airbyte-webapp-svc on port 8000. Each sync Airbyte runs will appear as a Kubernetes Job pod — visible with kubectl get pods -n airbyte -l airbyte=job.

Can dbt run on Kubernetes?

Yes, though dbt does not have a native Kubernetes operator. dbt is a CLI process that runs and exits, which maps directly to the Kubernetes Job resource type. For scheduled runs, wrap it in a CronJob. Use the official dbt-labs images from GitHub Container Registry (ghcr.io/dbt-labs/dbt-bigquery:1.8.0 for BigQuery, with equivalents for Snowflake, Postgres, and other adapters). For complex pipeline orchestration, run the dbt Job as a step inside Argo Workflows or Apache Airflow with the Kubernetes executor.

What is the difference between Airbyte and Meltano for Kubernetes ETL?

Airbyte is the heavier option with a full web UI, 300+ officially maintained connectors, and K8s-native job spawning — each sync becomes a real Kubernetes Job. It requires meaningful cluster resources (4+ vCPU, 8+ GB RAM) and has a more complex Helm configuration. Meltano is CLI-first, lighter weight, and relies on the Singer tap/target ecosystem (600+ community connectors). It has no native Kubernetes operator — you deploy it as a containerized CronJob. Airbyte is the better choice when you need a UI, many connectors, and built-in monitoring. Meltano fits teams that prefer code-first configuration, want to manage pipelines as YAML files in git, and are comfortable with simpler CronJob orchestration.


Get Started

Running ETL on Kubernetes gives you data residency, cost control, and integration flexibility that managed SaaS cannot match — at the cost of real infrastructure ownership.

Book a demo to see how ClankerCloud.ai fits into a Kubernetes-native data pipeline stack, or create an account to connect your cluster and start querying your pipeline health today.

For platform teams managing production data pipelines, the vibe coding to production guide covers the operational patterns that apply once your ETL jobs are running in cluster. The full documentation covers Kubernetes provider setup and MCP integration for autonomous pipeline monitoring.

Next step

Move the repo from prototype to production

Install the desktop app, connect GitHub plus one cloud provider, and review the deployment plan before Clanker Cloud touches real infrastructure.

Download Clanker CloudRead canonical article