Linux Performance Troubleshooting: A Systematic Approach
Learn the USE method for systematic Linux performance analysis. Master vmstat, iostat, sar, ss, strace, and perf with three real-world troubleshooting scenarios for CPU, memory, and disk bottlenecks.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Stop Guessing, Start Measuring
A production server is slow. Someone says "it's probably the database." Someone else suggests "maybe we need more RAM." Without a systematic approach, Linux performance troubleshooting devolves into guessing, and guessing is expensive -- both in time and in wrong fixes. You add RAM to a CPU-bound problem. You upgrade disks when the real bottleneck is network latency.
The USE method gives you a framework. The tools -- vmstat, iostat, sar, ss, strace, perf -- give you data. This guide covers both, then walks through three concrete scenarios: CPU-bound, memory-saturated, and disk I/O bottleneck.
What Is the USE Method?
Definition: The USE method is a performance analysis methodology created by Brendan Gregg. For every resource (CPU, memory, disk, network), check three things: Utilization (how busy is it), Saturation (is work queuing up), and Errors (are operations failing). This structured approach prevents guessing and ensures no resource is overlooked.
How to apply the USE method for Linux performance analysis
- List the resources -- CPU, memory, disk I/O, network interfaces, and any specialized hardware
- For each resource, check Utilization -- what percentage of capacity is in use? High utilization isn't always bad, but it narrows the search
- For each resource, check Saturation -- is work waiting in a queue? This is often the real bottleneck. A CPU at 100% utilization with no run queue is fine; 100% with a deep run queue means processes are starved
- For each resource, check Errors -- are disk writes failing? Are network packets being dropped? Errors often cause retries that manifest as performance problems
- Correlate findings -- high disk saturation might cause high CPU iowait. Follow the chain of evidence
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | vmstat (us, sy), mpstat -P ALL | vmstat (r column), load average | perf stat (machine check exceptions) |
| Memory | free -m, /proc/meminfo | vmstat (si/so for swap), dmesg OOM | dmesg (ECC errors) |
| Disk I/O | iostat -xz (%util) | iostat (avgqu-sz), iotop | smartctl, dmesg (I/O errors) |
| Network | sar -n DEV, ip -s link | ss -s (socket queues), tc -s qdisc | ip -s link (errors, drops) |
Essential Tools
vmstat: CPU and Memory at a Glance
# Print stats every 1 second, 10 times
vmstat 1 10
# Output:
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
# r b swpd free buff cache si so bi bo in cs us sy id wa st
# 2 0 0 524288 65536 1048576 0 0 12 48 150 300 15 3 80 2 0
Key columns:
- r -- processes waiting for CPU (run queue). If this consistently exceeds CPU count, you're CPU-saturated
- b -- processes in uninterruptible sleep (usually waiting for I/O)
- si/so -- swap in/out. Any non-zero value means the system is memory-constrained
- us -- user CPU time. High means your application is busy
- sy -- system (kernel) CPU time. High means lots of system calls or context switches
- wa -- iowait. CPU idle because it's waiting for I/O. High means disk bottleneck
- id -- idle. What's left over
iostat: Disk I/O Performance
# Extended stats for all devices, every 1 second
iostat -xz 1
# Key columns in output:
# Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s await r_await w_await svctm %util
# sda 45.0 120.0 1800 4800 2.0 15.0 4.2 3.1 4.6 1.8 85.0
Key columns:
- %util -- device utilization. 100% means the device is saturated (for rotational disks; SSDs can handle more than one request at a time, so 100% doesn't always mean saturated)
- await -- average time (ms) for I/O requests. If this spikes, the device is struggling
- avgqu-sz -- average queue length. Deep queues mean saturation
- r/s, w/s -- reads and writes per second (IOPS)
Watch out: For SSDs and NVMe drives,
%utilat 100% doesn't necessarily mean the device is saturated. These devices handle multiple parallel requests. Look atawaitlatency instead -- if it's climbing, the device is genuinely overloaded.
sar: Historical Data
# CPU utilization history (from sysstat collector)
sar -u
# Memory utilization
sar -r
# Disk I/O
sar -d
# Network throughput
sar -n DEV
# Specific time range
sar -u -s 14:00:00 -e 15:00:00
# From a specific day's data file
sar -u -f /var/log/sysstat/sa15
Pro tip: Install
sysstaton every server before you need it. It collects system stats every 10 minutes by default (via a cron job or timer). When an incident happened at 3 AM and you're investigating at 9 AM,saris the only tool that has historical data. Configure it to collect every 1 minute for production systems.
ss: Socket Statistics
# Summary of all socket types
ss -s
# All TCP connections with process info
ss -tnp
# Listening sockets
ss -tlnp
# Connections in TIME_WAIT state (common bottleneck)
ss -t state time-wait | wc -l
# Connections to a specific port
ss -tn dst :5432
# Show send/receive buffer sizes
ss -tnm
strace: System Call Tracing
# Trace a running process
strace -p 1234
# Trace with timestamps and follow forks
strace -p 1234 -f -T
# Count system calls (summary)
strace -p 1234 -c
# Trace only file operations
strace -p 1234 -e trace=file
# Trace only network operations
strace -p 1234 -e trace=network
perf: CPU Profiling
# Record CPU samples for 30 seconds
perf record -g -p 1234 -- sleep 30
# View the report
perf report
# One-liner: top functions by CPU time
perf top -p 1234
# Count hardware events
perf stat -p 1234 -- sleep 10
Scenario 1: CPU-Bound System
Symptoms: High load average, slow response times, vmstat shows high us (user) or sy (system) CPU.
# Step 1: Confirm CPU is the bottleneck
vmstat 1 5
# Look for: us+sy near 100%, r > number of CPUs
# Step 2: Find which processes are consuming CPU
top -b -n 1 | head -20
# Or: ps aux --sort=-%cpu | head -10
# Step 3: Profile the top consumer
perf record -g -p PID -- sleep 30
perf report
# Step 4: Check if it's a single-threaded bottleneck
mpstat -P ALL 1 5
# If one core is at 100% and others are idle, it's single-threaded
Common causes: Inefficient algorithm, regex backtracking, JSON serialization of large payloads, compression, tight loops without yielding.
Scenario 2: Memory-Saturated System
Symptoms: OOM kills in dmesg, swap usage climbing, processes getting killed randomly.
# Step 1: Check memory state
free -m
# total used free shared buff/cache available
# Mem: 16384 14000 200 128 2184 2100
# Swap: 8192 4000 4192
# Step 2: If swap is being used, confirm with vmstat
vmstat 1 5
# si/so > 0 means active swapping
# Step 3: Find the memory consumers
ps aux --sort=-%mem | head -10
# Step 4: Check for memory leaks over time
# Record RSS of a suspect process every minute
while true; do ps -o rss= -p 1234; sleep 60; done
# Step 5: Check OOM killer history
dmesg | grep -i "out of memory"
dmesg | grep -i "oom-kill"
Common causes: Memory leaks, oversized caches, too many worker processes, insufficient memory limits in container cgroups.
Scenario 3: Disk I/O Bottleneck
Symptoms: High iowait in vmstat, slow file operations, database query latency spikes.
# Step 1: Confirm disk is the bottleneck
vmstat 1 5
# Look for: wa (iowait) > 20%, b (blocked processes) > 0
# Step 2: Identify which device
iostat -xz 1 5
# Look for: %util near 100%, high await
# Step 3: Find which processes are doing the I/O
iotop -o
# Shows per-process I/O in real time
# Step 4: Check for filesystem issues
dmesg | grep -i "error"
smartctl -a /dev/sda # Check disk health
# Step 5: Profile the I/O pattern
perf record -e block:block_rq_issue -a -- sleep 10
perf report
Common causes: Database without proper indexes, logging too much, filesystem full (causes journal writes), swap thrashing, RAID rebuild in progress.
Monitoring and Observability Costs
Catching performance issues before they become incidents requires monitoring. Here's what the tools cost:
| Tool | Type | Cost | Best For |
|---|---|---|---|
| Prometheus + Grafana | Self-hosted | Free (OSS) | Custom metrics, node-exporter for USE data |
| Datadog Infrastructure | SaaS | $15/host/month | Out-of-box dashboards, anomaly detection |
| New Relic | SaaS | Free tier, $0.30/GB after | APM plus infrastructure in one |
| Netdata | Self-hosted | Free (OSS) | Per-second granularity, zero config |
| Grafana Cloud | SaaS | Free tier (10k metrics) | Hosted Prometheus + Grafana |
Frequently Asked Questions
What is the USE method in Linux performance analysis?
The USE method checks three things for every system resource: Utilization (percent busy), Saturation (work queuing), and Errors (failed operations). Created by Brendan Gregg, it provides a systematic checklist that prevents guessing. For each resource -- CPU, memory, disk, network -- you measure all three metrics before drawing conclusions.
What does high iowait mean in vmstat or top?
High iowait means CPU cores are idle specifically because they're waiting for I/O operations (usually disk) to complete. It indicates a disk bottleneck, not a CPU problem. The fix is to reduce I/O (add indexes, reduce logging, cache more) or improve I/O performance (faster disks, RAID, SSD). Adding more CPU won't help.
How do I find which process is using the most disk I/O?
Use iotop -o to see per-process I/O in real time (the -o flag shows only processes doing I/O). If iotop isn't installed, use pidstat -d 1 from the sysstat package. You can also check /proc/PID/io for cumulative I/O stats of a specific process.
What is the difference between load average and CPU utilization?
CPU utilization measures what percentage of CPU time is actively used. Load average counts the number of processes that are either running on CPU or waiting to run (and on Linux, also waiting for I/O). A system with 4 CPUs and a load average of 4.0 is at capacity. A load of 8.0 means processes are queuing.
When should I use strace vs perf for debugging?
Use strace when you suspect the problem is in system calls -- file access, network calls, process management. Use perf when you suspect the problem is CPU-bound -- tight loops, expensive computations, cache misses. strace adds significant overhead and should not be used on high-traffic production processes. perf has minimal overhead.
How do I check if my server is swapping?
Run vmstat 1 5 and check the si (swap in) and so (swap out) columns. Any sustained non-zero values mean active swapping. Also check free -m for total swap usage and sar -W for historical swap activity. Swapping kills performance because disk is orders of magnitude slower than RAM.
What tools should I install on every Linux server proactively?
At minimum: sysstat (for sar historical data), htop (better top), iotop (per-process I/O), and strace. On production systems, add perf (CPU profiling), tcpdump (packet capture), and a monitoring agent (Prometheus node-exporter or Datadog agent). Install these before incidents, not during them.
Conclusion
Performance troubleshooting is a skill, not a talent. The USE method gives you the framework: for every resource, check utilization, saturation, and errors. The tools -- vmstat for the quick overview, iostat for disk, ss for networking, strace for system calls, perf for CPU profiling -- give you the data. Stop guessing. Measure first, then fix the actual bottleneck. And install sysstat now, because the incident always happens at 3 AM and you'll want historical data when you start investigating at 9.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Linux File Permissions Explained: chmod, chown, and ACLs
Build the mental model for Linux file permissions from scratch. Learn chmod octal and symbolic notation, chown, umask, setuid/setgid/sticky bits, and POSIX ACLs with real-world scenarios.
12 min read
LinuxBash Scripting Best Practices for DevOps Engineers
Write reliable bash scripts with set -euo pipefail, proper quoting, [[ ]] tests, idempotent patterns, cleanup traps, ShellCheck, and knowing when to switch to Python.
10 min read
LinuxThe Linux Networking Stack: From Socket to NIC
Trace a packet through the entire Linux networking stack: socket buffers, the TCP state machine, IP routing, netfilter/iptables, traffic control, and NIC drivers with practical diagnostic tools.
10 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.