Linux Troubleshooting: A Practical Guide
I’ve been using Linux since I was a teenager, the first challenge I had to pass was put my US Robotics 36K modem to work without any internet and just by reading the Linux HOWTOs and using minicom, this made me not only love Linux but love to troubleshoot things.
Linux troubleshooting is not about memorizing commands, is about understanding how the system behaves, identifying abnormal signals, and take action by isoliting these problems methodically. Strong troubleshooting skills makes a real difference between senior engineers from junior ones because production systems fail from time to time in different ways.
1. The Mindset for deal with troubles
Before running commands like a crazy, establish a mental model.
- What exactly is broken?
- When did it start?
- What changed recently?
- Is the problem reproducible?
Below is an example of bad troubleshooting:
systemctl restart svc
Good troubleshooting:
systemctl status nginx
cat /var/log/nginx.log
2. Checking the operating system health
CPU and Load
uptime
Results as below. If load average exceeds CPU core count, the system is overloaded.
18:32:10 up 7 days, 7:45, 3 users, load average: 8.21, 7.90, 7.50
Check CPU usage:
top
htop
Look for:
- Processes that are consuming excessive CPU or Memory
Memory
free
Example:
total used free shared buff/cache available
Mem: 15708 5153 6585 594 5171 10555
Swap: 15255 1904 13351
Warning signs:
- Low available memory
- High swap usage
Check swap:
swapon --show
Disk Space
df -h
Example:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 49G 200M 99% /
A full filesystem causes:
- Application crashe and logging failures
- Database corruption risk
Find large directories:
du -hs /* 2>/dev/null
3. Check for the logs
Logs provide direct evidence.
System logs:
journalctl -xe
Service-specific logs:
journalctl -u nginx
Kernel logs:
dmesg | more
Look for:
- errors and timeouts
- killed processes
- hardware failures
Example:
Out of memory: Killed process 17232 (redis)
This immediately identifies root cause. Redis went Out of Memory.
4. Check Processes
List processes by CPU or Memory:
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head
Check process tree, you can see the parents and threads:
pstree -p
Check if expected services are running:
systemctl status docker
5. Networking
Many production incidents are networking-related.
Check interfaces
ip addr show
Check routing
ip route show
Example:
default via 10.10.0.1 dev eth0
10.10.0.0/24 dev eth0 proto kernel scope link src 10.10.0.10
Test connectivity
ping 8.8.8.8
Test DNS:
dig google.com
Test TCP connectivity:
nc -vz lwn.net 443
Check if the proper ports are open
Verify listening services:
ss -tulnp
netstat -nutl
Example:
tcp LISTEN 0 128 0.0.0.0:22
tcp LISTEN 0 128 0.0.0.0:443
If the expected port is missing, the service is not running.
6. Check Disk I/O
High disk latency causes severe performance degradation. This happens frequently in blockchain nodes and should be a bottleneck for them.
iostat -xz 1
sar -d 1
Look for:
- high await
- high utilization (%util)
If percentages are above 90% this indicates disk saturation.
Check File Descriptors
Linux limits open files.
Check usage:
lsof | wc -l
Check limits:
ulimit -n
Check system-wide:
cat /proc/sys/fs/file-nr
Exhaustion causes failures like:
Too many open files
And you should check the limits at /etc/security/limits.conf
7. Check Recent Changes
Most incidents are caused by changes.
Check:
- deployment and package upgrades ** never upgrade production systems before staging ones
- configuration changes (/etc/*) and application related.
Package history:
Debian/Ubuntu:
grep "install " /var/log/dpkg.log
RHEL/CentOS:
yum history
The Senior Engineer Mindset
Senior engineers troubleshoot using a systematic flow:
- Verify operating/platform system health
- Check logs and dashboards
- Check if the resources given are OK
- Check networking
- Check dependencies
- Identify root cause
- Apply minimal corrective action
Linux troubleshooting is a deterministic process, not trial and error. Production systems always leave evidence through logs, metrics, and system state.
Mastering troubleshooting requires:
- understanding system internals
- structured investigation and pattern recognition
- the most important one: experience
These skills are essential for SRE, DevOps, and Platform Engineers. :-)