Linux

Linux Performance Troubleshooting: A Systematic Approach

Learn the USE method for systematic Linux performance analysis. Master vmstat, iostat, sar, ss, strace, and perf with three real-world troubleshooting scenarios for CPU, memory, and disk bottlenecks.

A
Abhishek Patel10 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Linux Performance Troubleshooting: A Systematic Approach
Linux Performance Troubleshooting: A Systematic Approach

Stop Guessing, Start Measuring

A production server is slow. Someone says "it's probably the database." Someone else suggests "maybe we need more RAM." Without a systematic approach, Linux performance troubleshooting devolves into guessing, and guessing is expensive -- both in time and in wrong fixes. You add RAM to a CPU-bound problem. You upgrade disks when the real bottleneck is network latency.

The USE method gives you a framework. The tools -- vmstat, iostat, sar, ss, strace, perf -- give you data. This guide covers both, then walks through three concrete scenarios: CPU-bound, memory-saturated, and disk I/O bottleneck.

What Is the USE Method?

Definition: The USE method is a performance analysis methodology created by Brendan Gregg. For every resource (CPU, memory, disk, network), check three things: Utilization (how busy is it), Saturation (is work queuing up), and Errors (are operations failing). This structured approach prevents guessing and ensures no resource is overlooked.

How to apply the USE method for Linux performance analysis

  1. List the resources -- CPU, memory, disk I/O, network interfaces, and any specialized hardware
  2. For each resource, check Utilization -- what percentage of capacity is in use? High utilization isn't always bad, but it narrows the search
  3. For each resource, check Saturation -- is work waiting in a queue? This is often the real bottleneck. A CPU at 100% utilization with no run queue is fine; 100% with a deep run queue means processes are starved
  4. For each resource, check Errors -- are disk writes failing? Are network packets being dropped? Errors often cause retries that manifest as performance problems
  5. Correlate findings -- high disk saturation might cause high CPU iowait. Follow the chain of evidence
ResourceUtilizationSaturationErrors
CPUvmstat (us, sy), mpstat -P ALLvmstat (r column), load averageperf stat (machine check exceptions)
Memoryfree -m, /proc/meminfovmstat (si/so for swap), dmesg OOMdmesg (ECC errors)
Disk I/Oiostat -xz (%util)iostat (avgqu-sz), iotopsmartctl, dmesg (I/O errors)
Networksar -n DEV, ip -s linkss -s (socket queues), tc -s qdiscip -s link (errors, drops)

Essential Tools

vmstat: CPU and Memory at a Glance

# Print stats every 1 second, 10 times
vmstat 1 10

# Output:
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  2  0      0 524288  65536 1048576    0    0    12    48  150  300 15  3 80  2  0

Key columns:

  • r -- processes waiting for CPU (run queue). If this consistently exceeds CPU count, you're CPU-saturated
  • b -- processes in uninterruptible sleep (usually waiting for I/O)
  • si/so -- swap in/out. Any non-zero value means the system is memory-constrained
  • us -- user CPU time. High means your application is busy
  • sy -- system (kernel) CPU time. High means lots of system calls or context switches
  • wa -- iowait. CPU idle because it's waiting for I/O. High means disk bottleneck
  • id -- idle. What's left over

iostat: Disk I/O Performance

# Extended stats for all devices, every 1 second
iostat -xz 1

# Key columns in output:
# Device  r/s   w/s   rkB/s  wkB/s  rrqm/s  wrqm/s  await  r_await  w_await  svctm  %util
# sda     45.0  120.0 1800   4800   2.0     15.0    4.2    3.1      4.6      1.8    85.0

Key columns:

  • %util -- device utilization. 100% means the device is saturated (for rotational disks; SSDs can handle more than one request at a time, so 100% doesn't always mean saturated)
  • await -- average time (ms) for I/O requests. If this spikes, the device is struggling
  • avgqu-sz -- average queue length. Deep queues mean saturation
  • r/s, w/s -- reads and writes per second (IOPS)

Watch out: For SSDs and NVMe drives, %util at 100% doesn't necessarily mean the device is saturated. These devices handle multiple parallel requests. Look at await latency instead -- if it's climbing, the device is genuinely overloaded.

sar: Historical Data

# CPU utilization history (from sysstat collector)
sar -u

# Memory utilization
sar -r

# Disk I/O
sar -d

# Network throughput
sar -n DEV

# Specific time range
sar -u -s 14:00:00 -e 15:00:00

# From a specific day's data file
sar -u -f /var/log/sysstat/sa15

Pro tip: Install sysstat on every server before you need it. It collects system stats every 10 minutes by default (via a cron job or timer). When an incident happened at 3 AM and you're investigating at 9 AM, sar is the only tool that has historical data. Configure it to collect every 1 minute for production systems.

ss: Socket Statistics

# Summary of all socket types
ss -s

# All TCP connections with process info
ss -tnp

# Listening sockets
ss -tlnp

# Connections in TIME_WAIT state (common bottleneck)
ss -t state time-wait | wc -l

# Connections to a specific port
ss -tn dst :5432

# Show send/receive buffer sizes
ss -tnm

strace: System Call Tracing

# Trace a running process
strace -p 1234

# Trace with timestamps and follow forks
strace -p 1234 -f -T

# Count system calls (summary)
strace -p 1234 -c

# Trace only file operations
strace -p 1234 -e trace=file

# Trace only network operations
strace -p 1234 -e trace=network

perf: CPU Profiling

# Record CPU samples for 30 seconds
perf record -g -p 1234 -- sleep 30

# View the report
perf report

# One-liner: top functions by CPU time
perf top -p 1234

# Count hardware events
perf stat -p 1234 -- sleep 10

Scenario 1: CPU-Bound System

Symptoms: High load average, slow response times, vmstat shows high us (user) or sy (system) CPU.

# Step 1: Confirm CPU is the bottleneck
vmstat 1 5
# Look for: us+sy near 100%, r > number of CPUs

# Step 2: Find which processes are consuming CPU
top -b -n 1 | head -20
# Or: ps aux --sort=-%cpu | head -10

# Step 3: Profile the top consumer
perf record -g -p PID -- sleep 30
perf report

# Step 4: Check if it's a single-threaded bottleneck
mpstat -P ALL 1 5
# If one core is at 100% and others are idle, it's single-threaded

Common causes: Inefficient algorithm, regex backtracking, JSON serialization of large payloads, compression, tight loops without yielding.

Scenario 2: Memory-Saturated System

Symptoms: OOM kills in dmesg, swap usage climbing, processes getting killed randomly.

# Step 1: Check memory state
free -m
#               total    used    free  shared  buff/cache  available
# Mem:          16384   14000     200     128        2184       2100
# Swap:          8192    4000    4192

# Step 2: If swap is being used, confirm with vmstat
vmstat 1 5
# si/so > 0 means active swapping

# Step 3: Find the memory consumers
ps aux --sort=-%mem | head -10

# Step 4: Check for memory leaks over time
# Record RSS of a suspect process every minute
while true; do ps -o rss= -p 1234; sleep 60; done

# Step 5: Check OOM killer history
dmesg | grep -i "out of memory"
dmesg | grep -i "oom-kill"

Common causes: Memory leaks, oversized caches, too many worker processes, insufficient memory limits in container cgroups.

Scenario 3: Disk I/O Bottleneck

Symptoms: High iowait in vmstat, slow file operations, database query latency spikes.

# Step 1: Confirm disk is the bottleneck
vmstat 1 5
# Look for: wa (iowait) > 20%, b (blocked processes) > 0

# Step 2: Identify which device
iostat -xz 1 5
# Look for: %util near 100%, high await

# Step 3: Find which processes are doing the I/O
iotop -o
# Shows per-process I/O in real time

# Step 4: Check for filesystem issues
dmesg | grep -i "error"
smartctl -a /dev/sda  # Check disk health

# Step 5: Profile the I/O pattern
perf record -e block:block_rq_issue -a -- sleep 10
perf report

Common causes: Database without proper indexes, logging too much, filesystem full (causes journal writes), swap thrashing, RAID rebuild in progress.

Monitoring and Observability Costs

Catching performance issues before they become incidents requires monitoring. Here's what the tools cost:

ToolTypeCostBest For
Prometheus + GrafanaSelf-hostedFree (OSS)Custom metrics, node-exporter for USE data
Datadog InfrastructureSaaS$15/host/monthOut-of-box dashboards, anomaly detection
New RelicSaaSFree tier, $0.30/GB afterAPM plus infrastructure in one
NetdataSelf-hostedFree (OSS)Per-second granularity, zero config
Grafana CloudSaaSFree tier (10k metrics)Hosted Prometheus + Grafana

Frequently Asked Questions

What is the USE method in Linux performance analysis?

The USE method checks three things for every system resource: Utilization (percent busy), Saturation (work queuing), and Errors (failed operations). Created by Brendan Gregg, it provides a systematic checklist that prevents guessing. For each resource -- CPU, memory, disk, network -- you measure all three metrics before drawing conclusions.

What does high iowait mean in vmstat or top?

High iowait means CPU cores are idle specifically because they're waiting for I/O operations (usually disk) to complete. It indicates a disk bottleneck, not a CPU problem. The fix is to reduce I/O (add indexes, reduce logging, cache more) or improve I/O performance (faster disks, RAID, SSD). Adding more CPU won't help.

How do I find which process is using the most disk I/O?

Use iotop -o to see per-process I/O in real time (the -o flag shows only processes doing I/O). If iotop isn't installed, use pidstat -d 1 from the sysstat package. You can also check /proc/PID/io for cumulative I/O stats of a specific process.

What is the difference between load average and CPU utilization?

CPU utilization measures what percentage of CPU time is actively used. Load average counts the number of processes that are either running on CPU or waiting to run (and on Linux, also waiting for I/O). A system with 4 CPUs and a load average of 4.0 is at capacity. A load of 8.0 means processes are queuing.

When should I use strace vs perf for debugging?

Use strace when you suspect the problem is in system calls -- file access, network calls, process management. Use perf when you suspect the problem is CPU-bound -- tight loops, expensive computations, cache misses. strace adds significant overhead and should not be used on high-traffic production processes. perf has minimal overhead.

How do I check if my server is swapping?

Run vmstat 1 5 and check the si (swap in) and so (swap out) columns. Any sustained non-zero values mean active swapping. Also check free -m for total swap usage and sar -W for historical swap activity. Swapping kills performance because disk is orders of magnitude slower than RAM.

What tools should I install on every Linux server proactively?

At minimum: sysstat (for sar historical data), htop (better top), iotop (per-process I/O), and strace. On production systems, add perf (CPU profiling), tcpdump (packet capture), and a monitoring agent (Prometheus node-exporter or Datadog agent). Install these before incidents, not during them.

Conclusion

Performance troubleshooting is a skill, not a talent. The USE method gives you the framework: for every resource, check utilization, saturation, and errors. The tools -- vmstat for the quick overview, iostat for disk, ss for networking, strace for system calls, perf for CPU profiling -- give you the data. Stop guessing. Measure first, then fix the actual bottleneck. And install sysstat now, because the incident always happens at 3 AM and you'll want historical data when you start investigating at 9.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.