Continuing the series of articles about troubleshooting we are now talking about Kubernetes one of the platform tools that I really like to work everyday.

Knowing how to troubleshoot efficiently is not about memorizing commands it is about understanding how the control plane, scheduler, kubelet, and networking interact and, of cours know how to use kubectl as an inspection tool.

This guide focuses on real-world debugging using:

  • kubectl get
  • kubectl describe
  • kubectl logs
  • kubectl events
  • kubectl exec
  • kubectl top
  • kubectl port-forward
  • kubectl debug

All examples assume a namespace called prod.


1. kubectl get

Always begin by checking object state.

## start always by checking the pods state
kubectl get pods -n prod

## to see more details you use `-o wide`
kubectl get pods -n prod -o wide

## for deeper inspection, show the full yaml
kubectl get pod api-exchange-893edf -n prod -o yaml

Generally what we are looking for is one of these things:

  • CrashLoopBackOff
  • ImagePullBackOff
  • Pending
  • Node placement
  • Restart count
  • Resource requests/limits
  • Environment variables
  • Mounted volumes
  • Node affinity and tolerations

2. kubectl describe

describe provides structured event-driven details. If you want to have a event driven details of some pod or resource, this is the subcommand you should use. You will be evaluating sections like:

  • Events
  • Container state
  • Last termination reason
  • Probes status
  • Conditions
kubectl describe pod api-exchange-893edf -n prod

Here is a common example of a health check probe failure:

Liveness probe failed: HTTP probe failed with statuscode: 500
Back-off restarting failed container

This immediately indicates:

  • The container starts
  • The health endpoint fails
  • Kubernetes restarts it

This is a common thing, and generally it made you craze if you not tweak the correct timeout for the readiness and liveness probes. Now the focus shifts to logs.

3. kubectl logs

As the subcommand name says, its basic used to read the output logs from the application / service you are debugging, here are the basic commands:

## to read the logs of a unique pod
kubectl logs api-exchange-893edf -n prod

## check multiple containers:
kubectl logs api-exchange-893edf -c api-container -n prod

## read the previous crash log:
kubectl logs api-exchange-893edf -p -n prod

## follow the live log, like a `tail -f`:
kubectl logs -f api-exchange-893edf -n prod

Its always nice to have on the logs these typical patterns:

  • DB connection refused, networking or service issue
  • TLS handshake failure, certificate issue
  • OOMKilled, memory limit too low
  • Well structured logs

4. kubectl events

Events sub command shows messages from the cluster itself based on timestamp and generally the event which was triggered.

## to check events in the whole namespace `prod`
kubectl events -n prod

## if you want to sort by timestamp:
kubectl events -n prod --sort-by='.lastTimestamp'

Events reveal:

  • Scheduling failures
  • Insufficient CPU/memory
  • Volume mount errors
  • Image pull failures
  • Node pressure

Here is an example of an event:

0/3 nodes are available: 3 Insufficient memory.

This is not an app problem. It is a scheduling constraint problem. It basically show that the node where the application was scheduled don’t have enough memory to start it.

5. kubectl exec

Sometimes only read events, logs and analyze a container outside is not enough, you need to go inside them and do your stuff, thats why we have exec.

Use it to verify:

  • DNS resolution from CoreDNS or any other serivce
  • Service connectivity between pods
  • Environment variables
  • Mounted file system, secrets and file permissions
## starts the /bin/sh shell into the container
kubectl exec -it api-exchange-893edf -n prod -- /bin/sh

## there you could you use diagnotic tools to check real problems
curl http://another-pod.kubernetes.default.svc:8080
cat /etc/resolv.conf
nc database 5432

If exec fails because the container crashes too fast, use kubectl debug.

6. kubectl debug

Attach a debug container, this is useful for:

  • Images that has no shell
  • Distroless containers
  • Production minimal images
kubectl debug -it api-exchange-893edf -n prod --image=busybox --target=api-container

7. kubectl top

One of the main issues on the day by day is about the resources used by the containers, to check it you will need to deploy the metrics-server, this subcommand will show you the memory and cpu used by the containers.

kubectl top pods -n prod
kubectl top nodes

8. kubectl port-forward

And of course, sometimes you need to do network requests to a specific container to try validade somethings, for that you can open a local port to a remote service and do your calls

## open the service `api` pod locally
kubectl port-forward svc/api 8080:80 -n prod

## execute a curl test
curl http://localhost:8080/health
Now test:

9. Service and Node Level Debugging

Check endpoints:

# check endpoints and look for label mismatches and incorrect selectors
kubectl get endpoints api -n prod

# check for the service definition
kubectl describe svc api -n prod

# look for node status to find disk and memory pressures 
kubectl get nodes
kubectl describe node worker-us-08

Troubleshooting Flow

  1. Is the pod running?
  2. If not, start by using kubectl describe command.
  3. If running but failing, check the logs with kubectl logs.
  4. If unstable, check previous logs using kubectl logs -p
  5. If networking, try kubectl port-forward or kubectl exec
  6. If scheduling, check with kubectl events
  7. If resource-related, memory and cpu will appear in kubectl top
  8. If deeper isolation required debug

Remember, troubleshooting has always a order of execution and its a discipline that you need to master if you want to be a good Systems Administrator and Kubernetes guru.