Skip to main content
Back to blog

Which ETL Tools Support Containerized or Kubernetes-Based Deployment? The 2026 Tier Guide

Merged into the canonical ETL Kubernetes deployment guide to keep one stable operational page for the topic.

Merged article

This topic now lives on one canonical page

This tier-guide variant was merged into the canonical ETL deployment guide to avoid splitting the same operational topic across multiple URLs.

Read the canonical article

Every ETL vendor's documentation has a section titled "Deploy on Kubernetes." Most of those sections install the tool in a container. Few of them describe a tool that was designed for Kubernetes. The difference matters enormously at production scale — and the confusion between the two has cost engineering teams months of unplanned work.

This guide defines three deployment tiers for ETL tools in 2026, walks through actual Helm install commands, and tells you which tools will hold up when your data-pipeline namespace starts consuming serious resources. For teams already running Kubernetes and trying to understand what they have deployed, Clanker Cloud's plain-English infrastructure queries — "show me all ETL jobs running in namespace data-pipeline" — are covered at the end.


1. Why "Runs in Docker" Does Not Mean "Kubernetes-Ready"

A container image and a Kubernetes-native architecture are different things. Running an ETL tool in a Docker container means the vendor packaged their application as an OCI image. That is a prerequisite for K8s, not a qualification.

Kubernetes-readiness requires additional properties:

  • Resource contracts. Pods must declare requests and limits for CPU and memory so the scheduler can place them correctly. Tools that skip this will be evicted under load.
  • Lifecycle management. Production workloads need liveness and readiness probes. Without them, Kubernetes cannot distinguish a healthy pod from one that is deadlocked.
  • Stateful workload handling. ETL tools with state (checkpoints, metadata databases) require StatefulSet semantics or PersistentVolumeClaim management — not just a bare Deployment.
  • Operator-driven reconciliation. Tools that provide Custom Resource Definitions and a controller can express their entire configuration in Kubernetes YAML. The cluster reconciles toward the declared state continuously. Tools without this require manual Helm upgrades for every config change.

A tool deployed with docker run and then lifted into a pod spec via kubectl run has none of these properties unless someone added them manually. In practice, that manual work is where production incidents happen.


2. The Three Tiers Defined

Tier 1: Kubernetes-Native

These tools were designed with Kubernetes as the runtime. They ship Custom Resource Definitions, reconciliation controllers (operators), and manage workload lifecycle without human intervention. Configuration is expressed as Kubernetes objects. Failure recovery, scaling, and upgrades happen through the Kubernetes control plane.

Defining characteristics: CRDs, operator pattern, Helm chart with operator deployment, webhook admission, GitOps-compatible by default.

Tier 2: Helm-Deployable, Not Operator-Managed

These tools have official Helm charts and produce valid Kubernetes workloads. They support resource limits, probes, and namespace isolation. However, their runtime behavior is not governed by a Kubernetes operator — configuration changes require re-running helm upgrade, and there is no controller watching for drift or reconciling desired state. They are production-viable with careful operational discipline.

Defining characteristics: official Helm chart, no custom controller, manual upgrade workflow, Helm values drive configuration.

Tier 3: Container-Compatible, Not Kubernetes-Ready

These tools can run inside a container but were designed for single-host deployment. They rely on Docker Compose for service orchestration, use host-mounted volumes without PVC abstraction, and expose no readiness semantics. Running them on Kubernetes is possible but involves wrapping a tool that is fundamentally resisting the runtime model.

Defining characteristics: Docker Compose as the primary deployment mechanism, no Helm chart, no resource declarations, no probe support.


3. Tier 1 Tools: Kubernetes-Native ETL

Apache Spark Operator

The Spark Operator, maintained by the Kubeflow project and now widely adopted independently, introduces the SparkApplication CRD. A Spark job is declared as a Kubernetes object:

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
    name: daily-customer-transform
    namespace: data-pipeline
spec:
    type: Python
    mode: cluster
    image: "gcr.io/spark-operator/pyspark:v3.1.1"
    driver:
        cores: 1
        memory: "512m"
        serviceAccount: spark
    executor:
        cores: 2
        instances: 3
        memory: "2g"

The operator watches for SparkApplication objects and handles driver/executor pod lifecycle, retry logic, and status reporting. From the control plane's perspective, a Spark job is a first-class Kubernetes resource.

Install the operator:

helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --set webhook.enable=true

Enabling the webhook allows the operator to intercept pod creation and inject Spark-specific configuration. In production, you also want --set metrics.enable=true to expose Prometheus metrics per application.

Strimzi (Kafka Connect for ETL)

Strimzi is the Kubernetes operator for Apache Kafka. For ETL workloads, the relevant CRD is KafkaConnect combined with KafkaConnector. A connector configuration looks like:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
    name: raw-orders
    labels:
        strimzi.io/cluster: production-kafka
spec:
    partitions: 12
    replicas: 3
    config:
        retention.ms: 604800000

The Strimzi operator reconciles Kafka cluster state, manages rolling upgrades, handles certificate rotation, and scales brokers without manual intervention.

Install:

helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
  --namespace kafka \
  --create-namespace

For teams treating Kafka as an ETL backbone — reading from sources, transforming in-stream, landing to data warehouses — Strimzi is the only K8s-native path.


4. Tier 2 Tools: Helm-Deployable ETL

Airbyte

Airbyte ships an official Helm chart and is the most commonly deployed open-source ETL tool on Kubernetes. It runs as a collection of pods: airbyte-server, airbyte-worker, airbyte-scheduler, airbyte-webapp, and an internal airbyte-temporal workflow engine.

Install:

helm install airbyte airbyte/airbyte \
  --namespace airbyte \
  --create-namespace

Verify the deployment:

kubectl get pods -n airbyte

In production, the default resource allocations will not hold. A values.yaml override is required:

worker:
    resources:
        requests:
            cpu: "500m"
            memory: "1Gi"
        limits:
            cpu: "2"
            memory: "4Gi"
    replicaCount: 3

server:
    resources:
        requests:
            cpu: "250m"
            memory: "512Mi"
        limits:
            cpu: "1"
            memory: "2Gi"

Apply with:

helm upgrade airbyte airbyte/airbyte \
  --namespace airbyte \
  --values values.yaml

Airbyte is Tier 2 because there is no operator watching pod state and reconciling toward declared intent. If a worker pod drifts from its expected configuration, no controller corrects it. Helm upgrade is the change mechanism. That is workable with GitOps tooling — using kubectl describe helmrelease airbyte -n flux-system with Flux CD gives you drift visibility at the Helm release level — but it is a human-driven loop, not an automated one.

Meltano

Meltano is an open-source ELT framework built around Singer taps and targets. Its Kubernetes deployment model is a CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
    name: meltano-gitlab-to-postgres
    namespace: data-pipeline
spec:
    schedule: "0 4 * * *"
    jobTemplate:
        spec:
            template:
                spec:
                    containers:
                        - name: meltano
                          image: meltano/meltano:latest
                          command:
                              [
                                  "meltano",
                                  "run",
                                  "tap-gitlab",
                                  "target-postgres",
                              ]
                          resources:
                              requests:
                                  cpu: "250m"
                                  memory: "512Mi"
                              limits:
                                  cpu: "1"
                                  memory: "2Gi"
                    restartPolicy: OnFailure

Meltano does not have a Helm chart in the same form as Airbyte — the deployment model is a containerized CLI invocation scheduled as a K8s job. This is perfectly valid for batch ELT workloads and fits naturally into existing Kubernetes CronJob infrastructure. The limitation is that there is no central control plane; each pipeline is its own CronJob, and visibility requires querying individual job statuses.


5. Tier 3 Tools: What Does Not Make the Cut

Several tools with significant market presence fall into Tier 3 for Kubernetes production workloads:

Fivetran / Stitch (managed SaaS): Not deployable on your cluster at all. Infrastructure is fully managed by the vendor. Kubernetes deployment is not a concept that applies.

Apache NiFi (traditional deployment): NiFi has cluster support but relies on ZooKeeper coordination and host-based configuration. The nifi-registry and nifi nodes are typically deployed as StatefulSets with significant manual configuration. The operator ecosystem is immature compared to Spark or Strimzi. It is possible to run NiFi on K8s, but the operational burden is high and the failure modes are poorly documented for containerized environments.

Pentaho Data Integration (Kettle): The primary deployment model is a standalone Java application or a server jar. Docker images exist but are community-maintained. There is no Helm chart, no PVC management, and no liveness probe support. Running Kettle on Kubernetes means writing all of that infrastructure yourself.

Older Luigi or Oozie deployments: These tools were built for Hadoop-era architectures. Containerizing them is an exercise in wrapping a tool that assumes a very different runtime model.

The practical cost of Tier 3 tools on Kubernetes is not the initial deployment — that is usually achievable with enough custom YAML. The cost is everything that happens afterward: rolling updates that require manual pod cycling, no automated recovery from node failure, no resource-aware scheduling, and no GitOps integration.


6. 2026 Verdict Table

Tool Tier Deployment Model Production-Ready Rating Minimum K8s Expertise
Spark Operator 1 — K8s-Native CRD + operator High — operator-managed lifecycle Intermediate
Strimzi / Kafka Connect 1 — K8s-Native CRD + operator High — production-grade operator Intermediate
Airbyte 2 — Helm-Deployable Helm chart Medium-High — requires tuned values.yaml Basic
Meltano 2 — Helm-Deployable CronJob / container Medium — batch workloads only Basic
Apache Airflow (K8s Executor) 2 — Helm-Deployable Helm chart + KubernetesExecutor Medium-High — executor model is solid Intermediate
Apache NiFi 2/3 — Borderline StatefulSet, manual config Medium — high operational burden Advanced
Pentaho / Kettle 3 — Container-Compatible Custom YAML required Low — no native K8s support Advanced
Luigi / Oozie 3 — Container-Compatible Dockerfile only Low — not designed for K8s Advanced

Teams starting new data pipeline infrastructure in 2026 should default to Tier 1 or Tier 2 tools. Tier 3 is a migration project, not a deployment option.


7. Clanker Cloud: Live Visibility Into Deployed ETL Workloads

Once your ETL tools are deployed across namespaces — airbyte, data-pipeline, kafka, spark-operator — the operational question becomes observability. What is running. What is consuming resources. What failed overnight.

Clanker Cloud connects to your Kubernetes clusters (EKS, GKE, AKS) and answers plain-English infrastructure queries without console-hopping. Query: "show me all ETL jobs running in namespace data-pipeline" returns running pod status, resource consumption, and recent events for every workload in that namespace. No context-switching between kubectl sessions and the AWS console.

The Deep Research feature fans out across namespaces for cross-namespace cost attribution — useful when you need to understand whether your airbyte-worker pods or your Spark executor pods are driving up the compute bill. It produces severity-graded findings and exports as JSON or Markdown for your runbooks.

For GitOps workflows, the query "describe HelmRelease airbyte in flux-system" surfaces the same information as kubectl describe helmrelease airbyte -n flux-system but with plain-English synthesis: which version is deployed, whether the release is in a failed state, and what the last reconciliation event was.

BYOK model support means you choose the inference engine. Local and private workloads fit well with Gemma 4 via Ollama (gemma4:31b, gemma4:26b) or Hermes (hermes3:70b). For complex cross-namespace analysis requiring extended context, Claude Code or Codex via their respective APIs are available within the same interface. Credentials stay local — they never leave your machine.

Teams moving fast from prototype to production will find the vibe-coding-to-production workflow relevant: query your cluster state, inspect what is deployed, generate a reviewed plan for changes, and apply only after explicit approval. The AI DevOps for Teams guide covers multi-team namespace management with Clanker Cloud.

Full documentation is at docs.clankercloud.ai. To connect your Kubernetes cluster, start at clankercloud.ai/account.


8. FAQ

What is the difference between a container-native ETL tool and one that is just container-compatible?

A container-native tool was designed around Kubernetes primitives: Custom Resource Definitions, operator-based reconciliation, and native support for pod lifecycle management. Container-compatible tools can run in a container but require significant manual YAML configuration to behave reliably on Kubernetes. The distinction becomes visible during upgrades, node failures, and resource-constrained conditions.

Is Airbyte production-ready on Kubernetes?

Yes, with the right configuration. The default Helm install is not production-ready — it uses shared resource limits and a single worker replica. Overriding worker.resources, server.resources, and replicaCount in values.yaml, combined with a GitOps workflow (Flux or Argo CD) for managing Helm releases, produces a deployment that holds up under sustained load. Airbyte is Tier 2, not Tier 1, because there is no operator watching for drift.

Does Meltano support Kubernetes-native scheduling?

Meltano itself is a CLI tool. Its Kubernetes deployment model is a container image invoked as a CronJob or Job. This is a valid and lightweight pattern for batch ELT pipelines. There is no Meltano operator and no CRD. For teams with existing CronJob infrastructure, Meltano fits well. For teams wanting operator-managed state and failure recovery, Meltano requires supplementing with external workflow tooling.

How do I know which ETL tools are actually deployed in my cluster?

kubectl get pods --all-namespaces -l app=airbyte-worker surfaces Airbyte workers. kubectl get sparkapplications --all-namespaces lists Spark jobs if the operator is installed. kubectl get kafkaconnectors --all-namespaces lists Strimzi connectors. For a unified view across namespaces without running multiple kubectl commands, Clanker Cloud's live infrastructure queries provide a plain-English summary from a single interface. See the for AI agents guide for how agents can query cluster state programmatically via MCP.


9. CTA

The tier classification is a starting point. What it reveals — in practice — depends on how your cluster is actually configured today. Pods running without resource limits, Helm releases that have drifted from their declared values, Spark applications with no retry policy: these are the conditions that produce 2 AM incidents.

Clanker Cloud connects to your cluster and surfaces that state in plain English. Query your data-pipeline namespace, inspect your Helm releases, and identify which workloads are running outside their intended configuration — before a node failure does it for you.

Connect your cluster at clankercloud.ai/account. For questions about what Clanker Cloud can query on Kubernetes, see the FAQ. For a live walkthrough, the demo covers namespace-level workload inspection and cross-namespace cost attribution.

Next step

Move the repo from prototype to production

Install the desktop app, connect GitHub plus one cloud provider, and review the deployment plan before Clanker Cloud touches real infrastructure.

Download Clanker CloudRead canonical article