Linux Performance Troubleshooting: USE Method Guide

Q: How do I find which process is using the most disk I/O

Use iotop -o to see per-process I/O in real time (the -o flag shows only processes doing I/O). If iotop isn't installed, use pidstat -d 1 from the sysstat package. You can also check /proc/PID/io for cumulative I/O stats of a specific process.

Q: When should I use strace vs perf for debugging

Use strace when you suspect the problem is in system calls -- file access, network calls, process management. Use perf when you suspect the problem is CPU-bound -- tight loops, expensive computations, cache misses. strace adds significant overhead and should not be used on high-traffic production processes. perf has minimal overhead.

Q: How do I check if my server is swapping

Run vmstat 1 5 and check the si (swap in) and so (swap out) columns. Any sustained non-zero values mean active swapping. Also check free -m for total swap usage and sar -W for historical swap activity. Swapping kills performance because disk is orders of magnitude slower than RAM.

Q: What tools should I install on every Linux server proactively

At minimum: sysstat (for sar historical data), htop (better top), iotop (per-process I/O), and strace . On production systems, add perf (CPU profiling), tcpdump (packet capture), and a monitoring agent (Prometheus node-exporter or Datadog agent). Install these before incidents, not during them.

3:14 AM: "API p99 is 11 Seconds, I Have No Idea Why"

The pager was explicit -- p99 latency on the orders API had climbed from 180 ms to 11,200 ms over 40 minutes, error rate still green, traffic flat. I opened the dashboard for our four-node Linux fleet and started the usual first-look round:

uptime on every node: load averages of 34, 28, 31, 29 on 8-core boxes. That is roughly 4x over-subscribed.
vmstat 1 5 on the worst node: r (run queue) of 26, us at 22 percent, sy at 6, wa at 68 percent. IO wait, not CPU.
iostat -xz 1: nvme0n1 showing %util of 100, await climbing from a normal 0.6 ms to 340 ms, avgqu-sz of 12.
iotop -o: the top disk consumer was kworker/u16:2+flush-259:0 -- the kernel flush daemon, not our application. Something was hammering the page cache dirty list.

Four commands, maybe ninety seconds, and the problem had narrowed from "API is slow" to "filesystem flush on one device is saturated." dmesg -T | tail finished the chain: an EBS volume had silently degraded into a paravirtualized reduced performance state after a hypervisor migration at 2:41 AM. No error anywhere in our application logs. Nothing a developer could have inferred from tracing. The fix was a volume replacement that took 14 minutes. The diagnosis took 90 seconds -- but only because the commands were in muscle memory and the order was not improvised.

The order mattered: load average told me the system was overloaded, vmstat told me the overload was I/O rather than CPU, iostat pinpointed the device, and iotop named the culprit. That is the USE method applied at 3 AM: check Utilization, Saturation, and Errors for every resource in turn, and let the data tell you which layer is actually suffering before you start guessing.

The rest of this guide is the reference version of that 3 AM run-through. The USE method, the five tools you need to own -- vmstat, iostat, sar, ss, strace, perf -- and three worked scenarios (CPU-bound, memory-saturated, disk I/O bottleneck) that cover roughly 80 percent of real production incidents.

The USE Method: Utilization, Saturation, Errors

The USE method was named by Brendan Gregg. For every resource in the system, you check three things -- Utilization (percent busy), Saturation (work waiting in a queue), and Errors (failed operations) -- and you do it in a fixed order so nothing is skipped. It is deliberately boring, which is exactly the point at 3 AM when your instinct is to jump to the thing you debugged last week.

How to apply the USE method for Linux performance analysis

List the resources -- CPU, memory, disk I/O, network interfaces, and any specialized hardware
For each resource, check Utilization -- what percentage of capacity is in use? High utilization isn't always bad, but it narrows the search
For each resource, check Saturation -- is work waiting in a queue? This is often the real bottleneck. A CPU at 100% utilization with no run queue is fine; 100% with a deep run queue means processes are starved
For each resource, check Errors -- are disk writes failing? Are network packets being dropped? Errors often cause retries that manifest as performance problems
Correlate findings -- high disk saturation might cause high CPU iowait. Follow the chain of evidence

Resource	Utilization	Saturation	Errors
CPU	`vmstat` (us, sy), `mpstat -P ALL`	`vmstat` (r column), load average	`perf stat` (machine check exceptions)
Memory	`free -m`, `/proc/meminfo`	`vmstat` (si/so for swap), `dmesg` OOM	`dmesg` (ECC errors)
Disk I/O	`iostat -xz` (%util)	`iostat` (avgqu-sz), `iotop`	`smartctl`, `dmesg` (I/O errors)
Network	`sar -n DEV`, `ip -s link`	`ss -s` (socket queues), `tc -s qdisc`	`ip -s link` (errors, drops)

Essential Tools

vmstat: CPU and Memory at a Glance

# Print stats every 1 second, 10 times
vmstat 1 10

# Output:
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  2  0      0 524288  65536 1048576    0    0    12    48  150  300 15  3 80  2  0

Key columns:

r -- processes waiting for CPU (run queue). If this consistently exceeds CPU count, you're CPU-saturated
b -- processes in uninterruptible sleep (usually waiting for I/O)
si/so -- swap in/out. Any non-zero value means the system is memory-constrained
us -- user CPU time. High means your application is busy
sy -- system (kernel) CPU time. High means lots of system calls or context switches
wa -- iowait. CPU idle because it's waiting for I/O. High means disk bottleneck
id -- idle. What's left over

iostat: Disk I/O Performance

# Extended stats for all devices, every 1 second
iostat -xz 1

# Key columns in output:
# Device  r/s   w/s   rkB/s  wkB/s  rrqm/s  wrqm/s  await  r_await  w_await  svctm  %util
# sda     45.0  120.0 1800   4800   2.0     15.0    4.2    3.1      4.6      1.8    85.0

Key columns:

%util -- device utilization. 100% means the device is saturated (for rotational disks; SSDs can handle more than one request at a time, so 100% doesn't always mean saturated)
await -- average time (ms) for I/O requests. If this spikes, the device is struggling
avgqu-sz -- average queue length. Deep queues mean saturation
r/s, w/s -- reads and writes per second (IOPS)

Watch out: For SSDs and NVMe drives, %util at 100% doesn't necessarily mean the device is saturated. These devices handle multiple parallel requests. Look at await latency instead -- if it's climbing, the device is genuinely overloaded.

sar: Historical Data

# CPU utilization history (from sysstat collector)
sar -u

# Memory utilization
sar -r

# Disk I/O
sar -d

# Network throughput
sar -n DEV

# Specific time range
sar -u -s 14:00:00 -e 15:00:00

# From a specific day's data file
sar -u -f /var/log/sysstat/sa15

Pro tip: Install sysstat on every server before you need it. It collects system stats every 10 minutes by default (via a cron job or timer). When an incident happened at 3 AM and you're investigating at 9 AM, sar is the only tool that has historical data. Configure it to collect every 1 minute for production systems.

ss: Socket Statistics

# Summary of all socket types
ss -s

# All TCP connections with process info
ss -tnp

# Listening sockets
ss -tlnp

# Connections in TIME_WAIT state (common bottleneck)
ss -t state time-wait | wc -l

# Connections to a specific port
ss -tn dst :5432

# Show send/receive buffer sizes
ss -tnm

strace: System Call Tracing

# Trace a running process
strace -p 1234

# Trace with timestamps and follow forks
strace -p 1234 -f -T

# Count system calls (summary)
strace -p 1234 -c

# Trace only file operations
strace -p 1234 -e trace=file

# Trace only network operations
strace -p 1234 -e trace=network

perf: CPU Profiling

# Record CPU samples for 30 seconds
perf record -g -p 1234 -- sleep 30

# View the report
perf report

# One-liner: top functions by CPU time
perf top -p 1234

# Count hardware events
perf stat -p 1234 -- sleep 10

Scenario 1: CPU-Bound System

Symptoms: High load average, slow response times, vmstat shows high us (user) or sy (system) CPU.

# Step 1: Confirm CPU is the bottleneck
vmstat 1 5
# Look for: us+sy near 100%, r > number of CPUs

# Step 2: Find which processes are consuming CPU
top -b -n 1 | head -20
# Or: ps aux --sort=-%cpu | head -10

# Step 3: Profile the top consumer
perf record -g -p PID -- sleep 30
perf report

# Step 4: Check if it's a single-threaded bottleneck
mpstat -P ALL 1 5
# If one core is at 100% and others are idle, it's single-threaded

Common causes: Inefficient algorithm, regex backtracking, JSON serialization of large payloads, compression, tight loops without yielding.

Scenario 2: Memory-Saturated System

Symptoms: OOM kills in dmesg, swap usage climbing, processes getting killed randomly.

# Step 1: Check memory state
free -m
#               total    used    free  shared  buff/cache  available
# Mem:          16384   14000     200     128        2184       2100
# Swap:          8192    4000    4192

# Step 2: If swap is being used, confirm with vmstat
vmstat 1 5
# si/so > 0 means active swapping

# Step 3: Find the memory consumers
ps aux --sort=-%mem | head -10

# Step 4: Check for memory leaks over time
# Record RSS of a suspect process every minute
while true; do ps -o rss= -p 1234; sleep 60; done

# Step 5: Check OOM killer history
dmesg | grep -i "out of memory"
dmesg | grep -i "oom-kill"

Common causes: Memory leaks, oversized caches, too many worker processes, insufficient memory limits in container cgroups.

Scenario 3: Disk I/O Bottleneck

Symptoms: High iowait in vmstat, slow file operations, database query latency spikes.

# Step 1: Confirm disk is the bottleneck
vmstat 1 5
# Look for: wa (iowait) > 20%, b (blocked processes) > 0

# Step 2: Identify which device
iostat -xz 1 5
# Look for: %util near 100%, high await

# Step 3: Find which processes are doing the I/O
iotop -o
# Shows per-process I/O in real time

# Step 4: Check for filesystem issues
dmesg | grep -i "error"
smartctl -a /dev/sda  # Check disk health

# Step 5: Profile the I/O pattern
perf record -e block:block_rq_issue -a -- sleep 10
perf report

Common causes: Database without proper indexes, logging too much, filesystem full (causes journal writes), swap thrashing, RAID rebuild in progress.

Monitoring and Observability Costs

Catching performance issues before they become incidents requires monitoring. Here's what the tools cost:

Tool	Type	Cost	Best For
Prometheus + Grafana	Self-hosted	Free (OSS)	Custom metrics, node-exporter for USE data
Datadog Infrastructure	SaaS	$15/host/month	Out-of-box dashboards, anomaly detection
New Relic	SaaS	Free tier, $0.30/GB after	APM plus infrastructure in one
Netdata	Self-hosted	Free (OSS)	Per-second granularity, zero config
Grafana Cloud	SaaS	Free tier (10k metrics)	Hosted Prometheus + Grafana

Frequently Asked Questions

What is the USE method in Linux performance analysis?

The USE method checks three things for every system resource: Utilization (percent busy), Saturation (work queuing), and Errors (failed operations). Created by Brendan Gregg, it provides a systematic checklist that prevents guessing. For each resource -- CPU, memory, disk, network -- you measure all three metrics before drawing conclusions.

What does high iowait mean in vmstat or top?

High iowait means CPU cores are idle specifically because they're waiting for I/O operations (usually disk) to complete. It indicates a disk bottleneck, not a CPU problem. The fix is to reduce I/O (add indexes, reduce logging, cache more) or improve I/O performance (faster disks, RAID, SSD). Adding more CPU won't help.

How do I find which process is using the most disk I/O?

Use iotop -o to see per-process I/O in real time (the -o flag shows only processes doing I/O). If iotop isn't installed, use pidstat -d 1 from the sysstat package. You can also check /proc/PID/io for cumulative I/O stats of a specific process.

What is the difference between load average and CPU utilization?

CPU utilization measures what percentage of CPU time is actively used. Load average counts the number of processes that are either running on CPU or waiting to run (and on Linux, also waiting for I/O). A system with 4 CPUs and a load average of 4.0 is at capacity. A load of 8.0 means processes are queuing.

When should I use strace vs perf for debugging?

Use strace when you suspect the problem is in system calls -- file access, network calls, process management. Use perf when you suspect the problem is CPU-bound -- tight loops, expensive computations, cache misses. strace adds significant overhead and should not be used on high-traffic production processes. perf has minimal overhead.

How do I check if my server is swapping?

Run vmstat 1 5 and check the si (swap in) and so (swap out) columns. Any sustained non-zero values mean active swapping. Also check free -m for total swap usage and sar -W for historical swap activity. Swapping kills performance because disk is orders of magnitude slower than RAM.

What tools should I install on every Linux server proactively?

At minimum: sysstat (for sar historical data), htop (better top), iotop (per-process I/O), and strace. On production systems, add perf (CPU profiling), tcpdump (packet capture), and a monitoring agent (Prometheus node-exporter or Datadog agent). Install these before incidents, not during them.

Conclusion

Performance troubleshooting is a skill, not a talent. The USE method gives you the framework: for every resource, check utilization, saturation, and errors. The tools -- vmstat for the quick overview, iostat for disk, ss for networking, strace for system calls, perf for CPU profiling -- give you the data. Stop guessing. Measure first, then fix the actual bottleneck. And install sysstat now, because the incident always happens at 3 AM and you'll want historical data when you start investigating at 9.

Linux Performance Troubleshooting: A Systematic Approach