I’ve been using Linux since I was a teenager, the first challenge I had to pass was put my US Robotics 36K modem to work without any internet and just by reading the Linux HOWTOs and using minicom, this made me not only love Linux but love to troubleshoot things.

Linux troubleshooting is not about memorizing commands, is about understanding how the system behaves, identifying abnormal signals, and take action by isoliting these problems methodically. Strong troubleshooting skills makes a real difference between senior engineers from junior ones because production systems fail from time to time in different ways.


1. The Mindset for deal with troubles

Before running commands like a crazy, establish a mental model.

  • What exactly is broken?
  • When did it start?
  • What changed recently?
  • Is the problem reproducible?

Below is an example of bad troubleshooting:

systemctl restart svc

Good troubleshooting:

systemctl status nginx
cat /var/log/nginx.log

2. Checking the operating system health

CPU and Load

uptime

Results as below. If load average exceeds CPU core count, the system is overloaded.

18:32:10 up 7 days,  7:45,  3 users,  load average: 8.21, 7.90, 7.50

Check CPU usage:

top
htop

Look for:

  • Processes that are consuming excessive CPU or Memory

Memory

free

Example:

               total        used        free      shared  buff/cache   available
Mem:           15708        5153        6585         594        5171       10555
Swap:          15255        1904       13351

Warning signs:

  • Low available memory
  • High swap usage

Check swap:

swapon --show

Disk Space

df -h

Example:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   49G  200M  99% /

A full filesystem causes:

  • Application crashe and logging failures
  • Database corruption risk

Find large directories:

du -hs /* 2>/dev/null

3. Check for the logs

Logs provide direct evidence.

System logs:

journalctl -xe

Service-specific logs:

journalctl -u nginx

Kernel logs:

dmesg | more

Look for:

  • errors and timeouts
  • killed processes
  • hardware failures

Example:

Out of memory: Killed process 17232 (redis)

This immediately identifies root cause. Redis went Out of Memory.

4. Check Processes

List processes by CPU or Memory:

ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head

Check process tree, you can see the parents and threads:

pstree -p

Check if expected services are running:

systemctl status docker

5. Networking

Many production incidents are networking-related.

Check interfaces

ip addr show

Check routing

ip route show

Example:

default via 10.10.0.1 dev eth0
10.10.0.0/24 dev eth0 proto kernel scope link src 10.10.0.10

Test connectivity

ping 8.8.8.8

Test DNS:

dig google.com

Test TCP connectivity:

nc -vz lwn.net 443

Check if the proper ports are open

Verify listening services:

ss -tulnp
netstat -nutl

Example:

tcp LISTEN 0 128 0.0.0.0:22
tcp LISTEN 0 128 0.0.0.0:443

If the expected port is missing, the service is not running.

6. Check Disk I/O

High disk latency causes severe performance degradation. This happens frequently in blockchain nodes and should be a bottleneck for them.

iostat -xz 1
sar -d 1

Look for:

  • high await
  • high utilization (%util)

If percentages are above 90% this indicates disk saturation.

Check File Descriptors

Linux limits open files.

Check usage:

lsof | wc -l

Check limits:

ulimit -n

Check system-wide:

cat /proc/sys/fs/file-nr

Exhaustion causes failures like:

Too many open files

And you should check the limits at /etc/security/limits.conf

7. Check Recent Changes

Most incidents are caused by changes.

Check:

  • deployment and package upgrades ** never upgrade production systems before staging ones
  • configuration changes (/etc/*) and application related.

Package history:

Debian/Ubuntu:

grep "install " /var/log/dpkg.log

RHEL/CentOS:

yum history

The Senior Engineer Mindset

Senior engineers troubleshoot using a systematic flow:

  1. Verify operating/platform system health
  2. Check logs and dashboards
  3. Check if the resources given are OK
  4. Check networking
  5. Check dependencies
  6. Identify root cause
  7. Apply minimal corrective action

Linux troubleshooting is a deterministic process, not trial and error. Production systems always leave evidence through logs, metrics, and system state.

Mastering troubleshooting requires:

  • understanding system internals
  • structured investigation and pattern recognition
  • the most important one: experience

These skills are essential for SRE, DevOps, and Platform Engineers. :-)