Skip to main content
Back to blog

Kubernetes Debugging Without Memorizing kubectl: An AI-First Approach

Complete Kubernetes debugging guide: kubectl commands for every scenario, plus AI-first shortcuts. CrashLoopBackOff, Pending, RBAC, networking, and more.

Kubernetes debugging is one of the most searched topics in DevOps — and for good reason. Kubernetes is the most powerful container orchestration platform ever built. It is also one of the most verbose to investigate. A single pod crash can require 6–8 separate kubectl commands to fully understand what went wrong, which namespace, which node, which log line.

This article is a debugging reference guide. For every common failure scenario — CrashLoopBackOff, Pending pods, unreachable services, failed rollouts, RBAC errors — you will find the exact kubectl commands that surface the problem, followed by the equivalent Clanker Cloud plain-English question that gets you the same picture in one step. Use it as a bookmark, share it with your team, and treat the kubectl sections as a standalone reference even if you never touch Clanker Cloud.

Whether you are debugging Kubernetes production at 2 a.m. or building a K8s troubleshooting runbook for your platform team, every scenario below covers the real commands, real error states, and real root causes.


The kubectl Debugging Toolkit

Before diving into scenarios, here is the core set of commands every Kubernetes engineer has saved somewhere. These alone will get you most of the way through any investigation.

# List all pods in a namespace with node placement
kubectl get pods -n <namespace> -o wide

# Full pod spec, conditions, events, and resource requests
kubectl describe pod <pod-name> -n <namespace>

# Logs from the previous (crashed) container instance
kubectl logs <pod-name> -n <namespace> --previous

# Open a shell inside a running container
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Events sorted by time — often the fastest path to root cause
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>

# Live CPU and memory usage per pod
kubectl top pods -n <namespace>

Bookmark these. The scenarios below build on them.


Debugging Scenario 1: Pod Stuck in CrashLoopBackOff

CrashLoopBackOff is Kubernetes telling you the container started, crashed, and Kubernetes has given up restarting it at the normal rate. The backoff timer increases exponentially — meaning the longer it sits there, the longer you wait between each restart attempt.

kubectl approach

# Step 1: Identify which pods are crashing
kubectl get pods -n staging -o wide

# Step 2: Get the full picture — Events section is critical here
kubectl describe pod <pod-name> -n staging

# Step 3: Read the logs from the container that just crashed
kubectl logs <pod-name> -n staging --previous

# Step 4: Check recent events across the namespace
kubectl get events --sort-by=.metadata.creationTimestamp -n staging

# Step 5: If OOMKilled, check resource limits and usage
kubectl top pods -n staging

The exit code in describe is your first clue: exit code 137 means OOMKilled (the container exceeded its memory limit and was killed by the kernel). Exit code 1 typically means application error. Exit code 126 or 127 means a bad entrypoint or command not found.

Clanker Cloud equivalent

"Which pods in the staging namespace are crashing and what are the error messages?"

Clanker Cloud reads pods, events, and the previous container logs together and returns a correlated answer: which pods are in CrashLoopBackOff, the most recent exit code, the last lines of the previous log, and any relevant events — in a single response.

Common causes to look for

Symptom Root Cause
Exit code 137 OOMKilled — increase memory limit or fix memory leak
Exit code 1, application stack trace Bug or missing config — check env vars and config maps
exec format error Wrong image architecture (arm vs. amd64)
Error: could not find command Bad ENTRYPOINT or CMD in Dockerfile
Liveness probe failing Probe misconfigured — endpoint not ready before deadline
Error: missing required env var Missing environment variable in pod spec or ConfigMap/Secret

Debugging Scenario 2: Pod Stuck in Pending

A Pending pod has been accepted by the API server but the scheduler has not placed it on a node yet. The Events section of kubectl describe is the fastest path to understanding why.

kubectl approach

# Step 1: Confirm the pod is Pending and which node it sits on (none)
kubectl get pods -n production -o wide

# Step 2: Read the scheduler's reason in the Events section
kubectl describe pod <pod-name> -n production

# Step 3: Check actual node capacity and allocatable resources
kubectl describe nodes

# Step 4: If the pod needs storage, check PVC binding status
kubectl get pvc -n production
kubectl describe pvc <pvc-name> -n production

# Step 5: Verify scheduler events
kubectl get events --sort-by=.metadata.creationTimestamp -n production | grep -i "FailedScheduling"

The Events section from kubectl describe pod will usually tell you exactly what the scheduler rejected: 0/3 nodes are available: 3 Insufficient memory or 1 node(s) had untolerated taint.

Clanker Cloud equivalent

"Why is the payments pod stuck in Pending?"

The response traces the scheduling decision: node resource pressure, any unbound PVCs, taint/toleration mismatches, and node selector or affinity rules that could not be satisfied.

Common causes

  • Insufficient node resources — no node has enough CPU or memory to satisfy the pod's requests. Either add nodes, reduce requests, or free capacity.
  • PVC not bound — the pod is waiting for a PersistentVolumeClaim to find a matching PersistentVolume or for a dynamic provisioner to create one. Check StorageClass and provisioner logs.
  • Missing tolerations — a node has a taint (e.g., dedicated=gpu:NoSchedule) and the pod has no matching toleration.
  • Node selector / affinity mismatchnodeSelector or nodeAffinity rules require a label that no available node carries.
  • Resource quota exceeded — the namespace ResourceQuota is at its limit for CPU, memory, or pod count.

Debugging Scenario 3: Service Not Reachable

One of the most common Kubernetes networking problems: a service exists, the pods seem healthy, but traffic is not getting through. The chain to investigate is: Service → Endpoints → Pod labels → NetworkPolicy.

kubectl approach

# Step 1: Confirm the service exists and note its selector
kubectl get svc -n production
kubectl describe svc api-service -n production

# Step 2: Check if the service has any endpoints (if empty, no pods matched the selector)
kubectl get endpoints api-service -n production

# Step 3: Check pod labels — do they match the service selector exactly?
kubectl get pods -n production --show-labels

# Step 4: Test connectivity directly, bypassing DNS and load balancer
kubectl port-forward svc/api-service 8080:80 -n production

# Step 5: Check for NetworkPolicies that might be blocking traffic
kubectl get networkpolicy -n production
kubectl describe networkpolicy <policy-name> -n production

The most common trap: kubectl get endpoints returns <none> — which means no pods match the service's label selector. Check the selector in the service against the actual labels on the pods. A single typo (app: api vs. app: api-service) breaks the whole thing.

Clanker Cloud equivalent

"Why can't the frontend service reach the API service in the production namespace?"

Clanker Cloud traces the full chain: service spec and selector, endpoint objects, pod labels, and any NetworkPolicy rules that could block ingress or egress between the two. It flags label mismatches and policy gaps in the explanation.

Common causes

Problem What to look for
Empty endpoints Label selector in Service does not match pod labels
NetworkPolicy blocking Ingress/egress rules do not permit traffic between the two namespaces or pods
Wrong target port targetPort in the Service does not match the container's containerPort
DNS resolution failure CoreDNS pods unhealthy — check kubectl get pods -n kube-system
Pod not ready Pods exist but readiness probe is failing — excluded from endpoints

Debugging Scenario 4: High Memory or CPU Usage

Performance problems and unexpected scaling events are harder to debug because there is no single error state — you are hunting for a pattern across resource metrics, HPA decisions, and pod lifecycle events.

kubectl approach

# Step 1: Find the heaviest pods
kubectl top pods -n production --sort-by=memory
kubectl top pods -n production --sort-by=cpu

# Step 2: Check node-level pressure
kubectl top nodes
kubectl describe nodes | grep -A5 "Conditions:"

# Step 3: Review what limits are set (or missing)
kubectl describe pod <pod-name> -n production | grep -A10 "Limits:"

# Step 4: Check if the HPA is active and what it's doing
kubectl get hpa -n production
kubectl describe hpa <hpa-name> -n production

# Step 5: Look for recent scaling events
kubectl get events -n production --sort-by=.metadata.creationTimestamp | grep -i "scale"

Missing resource limits is the silent killer here. Without limits, a single misbehaving pod can consume all available memory on a node, causing OOMKilled cascades for every other pod on that node.

Clanker Cloud equivalent

"Which pods are using the most memory in the production cluster right now?"

"Is the HPA for the API deployment behaving correctly?"

The second question is particularly useful — it tells you whether the HPA is scaling up, at its maximum, unable to scale (no metrics server), or stuck in a cooling-off period.

Common causes

  • Memory leak — pod memory climbs steadily over hours or days. Compare current usage against the past 24h if you have metrics history.
  • Missing resource limits — pod spec has no resources.limits block; add them to protect the node.
  • HPA misconfiguration — target CPU percentage too high, wrong metric source, or minReplicas / maxReplicas set too conservatively.
  • Runaway Job or CronJob — a batch job consuming all CPU. Check kubectl get jobs -n production.
  • Node pressure — if nodes show MemoryPressure or DiskPressure conditions, Kubernetes will start evicting pods.

Debugging Scenario 5: Failed Rollout

A deployment rollout that stalls or fails leaves your cluster in a mixed state: some pods on the new version, some still on the old. The Deployment controller provides a rich audit trail if you know where to look.

kubectl approach

# Step 1: See if the rollout is progressing or stuck
kubectl rollout status deployment/api -n production

# Step 2: Review rollout history
kubectl rollout history deployment/api -n production

# Step 3: Inspect the Deployment spec and conditions
kubectl describe deployment api -n production

# Step 4: Find the new ReplicaSet and check its events
kubectl get rs -n production
kubectl describe rs <new-replicaset-name> -n production

# Step 5: Check if the image can actually be pulled
kubectl get pods -n production | grep -i "ImagePullBackOff\|ErrImagePull"
kubectl describe pod <failing-pod> -n production | grep -A5 "Events:"

ErrImagePull and ImagePullBackOff are immediate suspects: the tag does not exist, the registry is private and the pull secret is missing, or there is a typo in the image name. kubectl describe on the failing pod will show the exact registry error.

Clanker Cloud equivalent

"What happened with the last rollout of the API deployment?"

Returns a correlated view: the deployment's rollout status, which ReplicaSet is new vs. old, whether pods are failing due to image pull errors or readiness probe failures, and the full event trail — without running five separate commands.

Common causes

Error Root Cause
ImagePullBackOff / ErrImagePull Bad image tag, missing pull secret, or registry unavailable
Rollout stuck, pods not ready Readiness probe failing on new version — check probe endpoint and timeout
ReplicaSet not scaling up Resource quota exceeded in namespace — check kubectl describe quota -n production
Insufficient replicas available maxUnavailable budget means old pods must stay up until new ones pass readiness
Rollout progressing forever Missing or wrong progressDeadlineSeconds — set in Deployment spec

To recover: kubectl rollout undo deployment/api -n production rolls back to the previous ReplicaSet instantly.


Debugging Scenario 6: RBAC / Permissions Error

RBAC errors surface as Forbidden responses in pod logs, failed service account API calls, or operators that silently do nothing because they cannot list or patch resources. They are easy to confirm with one command.

kubectl approach

# Step 1: Test whether a specific service account can perform an action
kubectl auth can-i get secrets \
  --as=system:serviceaccount:production:api-service \
  -n production

# Step 2: List RoleBindings in the namespace
kubectl get rolebindings -n production -o wide

# Step 3: Inspect the specific binding and what Role it references
kubectl describe rolebinding <binding-name> -n production

# Step 4: Check the Role's rules
kubectl describe role <role-name> -n production

# Step 5: Check ClusterRoleBindings if namespace-scoped binding is not the source
kubectl get clusterrolebindings | grep api-service

kubectl auth can-i is the fastest diagnostic tool for RBAC. If it returns no, the service account lacks the permission. Trace backward through the RoleBinding → Role chain to find what is missing.

Clanker Cloud equivalent

"Does the api-service service account have permission to read secrets in the production namespace?"

Returns the effective permissions for that service account, which RoleBindings apply, and what the Role or ClusterRole actually grants — including whether access is coming from a ClusterRoleBinding rather than a namespaced one.

Common causes

  • No RoleBinding — the service account exists but has never been bound to a Role that includes the needed verb/resource pair.
  • Wrong namespace — the RoleBinding exists in staging but the service account operates in production.
  • ClusterRole vs. Role — the ClusterRole exists but no ClusterRoleBinding or RoleBinding attaches it to the service account.
  • Verb mismatch — Role grants get and list but the pod is trying to patch — add the missing verb.

How Clanker Cloud Works with Kubernetes

Clanker Cloud is a local-first AI workspace for infrastructure. The Kubernetes integration works against your existing cluster setup with no credential migration.

Connection model: Clanker Cloud reads your existing kubeconfig and inherits all cluster contexts you have configured — EKS, GKE, AKS, self-hosted, or any combination. Credentials never leave your machine. There is no hosted layer that your API tokens pass through.

Multi-cluster support: If you manage multiple clusters (a common setup for staging/production or multi-region), you can query across all contexts: "Which namespace across my clusters is consuming the most CPU right now?" or "Show me all deployments with no resource limits set."

Model options (BYOK): Clanker Cloud supports bring-your-own-key inference:

  • Gemma 4 (local) — for teams with strict data residency requirements; all K8s context stays on your machine with no outbound API calls
  • Claude / Codex agent workflows — run K8s investigations directly from Claude Code or Codex agent sessions via MCP integration
  • Hermes (local agent) — for autonomous, on-machine agent tasks without cloud inference

Read-first by default: Every investigation starts in read mode — Clanker Cloud queries your cluster but makes no changes. When you want to act (restart a pod, update a deployment, patch a ConfigMap, scale a deployment), you explicitly enable maker mode. Changes are shown as a plan for your review before anything is applied.

Full-stack context: The same workspace that answers your Kubernetes questions also has access to your AWS, GCP, Azure, Cloudflare, and GitHub contexts. This matters for real debugging — a pod OOMKilled might trace back to an RDS query taking 30 seconds, or a NetworkPolicy blocking traffic might interact with a Cloudflare WAF rule. See the AI DevOps for teams guide for full-stack debugging workflows, the AI researchers guide if you manage GPU or ML training clusters, and check the docs for full integration setup.


Quick Reference Card

Bookmark this. Problem → kubectl command → Clanker Cloud question.

Problem kubectl command Clanker Cloud question
Pod crashing (CrashLoopBackOff) kubectl logs <pod> -n <ns> --previous + kubectl describe pod "Which pods in staging are crashing and what are the errors?"
Pod stuck in Pending kubectl describe pod → Events section "Why is the payments pod stuck in Pending?"
Service not reachable kubectl get endpoints <svc> "Why can't frontend reach the API service in production?"
High memory usage kubectl top pods -n <ns> --sort-by=memory "Which pods are using the most memory in production?"
HPA not scaling kubectl describe hpa <name> "Is the HPA for the API deployment behaving correctly?"
Failed rollout kubectl rollout status deployment/<name> "What happened with the last rollout of the API deployment?"
Image pull error kubectl describe pod → Events: ErrImagePull "Why is the API pod failing to start in production?"
RBAC / Forbidden error kubectl auth can-i <verb> <resource> --as=<sa> "Does the api-service account have permission to read secrets?"
OOMKilled kubectl describe pod → exit code 137 "Which pods have been OOMKilled in the last 24 hours?"
Missing resource limits kubectl describe pod → Limits section "Which pods in production have no memory limits set?"
Taint/toleration mismatch kubectl describe node + kubectl describe pod "Why is the GPU job not scheduling onto the GPU node pool?"
NetworkPolicy blocking kubectl get networkpolicy -n <ns> "Is there a NetworkPolicy blocking traffic from frontend to API?"

FAQ

How do I debug a Kubernetes pod crash?

Start with kubectl get pods -n <namespace> -o wide to confirm which pods are in a bad state. Then run kubectl describe pod <name> -n <namespace> and read the Events section at the bottom — the scheduler and kubelet write the reason for failures there. For application-level crashes, use kubectl logs <pod> -n <namespace> --previous to read the logs from the last crashed container instance. The exit code in the describe output is the fastest signal: 137 is OOMKilled, 1 is usually an application error.

What does CrashLoopBackOff mean?

CrashLoopBackOff means Kubernetes started the container, the container exited with a non-zero code, and Kubernetes is now backing off before trying again. The backoff doubles each cycle (10s, 20s, 40s, up to 5 minutes). The container itself is not necessarily in a crash right now — it may be in the waiting period between restarts. Run kubectl logs <pod> --previous to read what the last instance printed before it exited. Common causes: OOMKilled (exit 137), application startup error (missing config, bad DB connection string), failed liveness probe, or a broken ENTRYPOINT in the container image.

How do I debug Kubernetes networking?

The standard chain: check that the Service exists and has a selector (kubectl describe svc <name>), then check that the Service has Endpoints (kubectl get endpoints <name>). Empty endpoints means no pods matched the selector — compare the selector labels against actual pod labels with kubectl get pods --show-labels. If endpoints exist but traffic is still failing, check NetworkPolicy rules with kubectl get networkpolicy -n <namespace>. For DNS issues, verify CoreDNS is healthy: kubectl get pods -n kube-system. Use kubectl port-forward to test connectivity directly against a pod, bypassing the service and any load balancer.

What is the fastest way to troubleshoot Kubernetes?

The single fastest command is kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>. Events aggregate scheduler decisions, kubelet actions, image pull results, and probe failures in one place, sorted by time. Combine with kubectl describe pod <name> for the full picture. For production incidents where you need to investigate multiple resources at once — pods, nodes, services, HPA, events — a tool like Clanker Cloud reduces a 6-command investigation to a single question and surfaces the correlated answer in seconds.


Conclusion

Kubernetes gives you fine-grained control over every aspect of your container workloads. That control comes with a steep debugging surface: six commands to investigate one crash, five more to understand a pending pod, another set for networking. This guide covers the commands you actually need for every common failure mode — save it, share it with your team, and use the reference table when you are under pressure.

If you want to move faster on K8s investigations — especially across multiple clusters or when a problem cuts across Kubernetes, your cloud provider, and your application layer — Clanker Cloud connects to your existing kubeconfig, reads your cluster in real time, and answers debugging questions in plain English. All your credentials stay local. See the documentation for the full Kubernetes integration guide.

Download the desktop app and try it free →

Next step

Ask Clanker Cloud what your cluster is doing

Install the local app, connect your kubeconfig, and turn cluster state, workload health, cost context, and safe next steps into one readable answer.

Download and inspect a clusterWatch demo