Run LLMs Without GPU: CPU Benchmarks & Performance Guide

The GPU Shortage Made Me Try Something Different

GPU prices are absurd. A single NVIDIA A100 costs $10,000+, and cloud GPU instances burn through budgets faster than most teams expect. So the question keeps coming up: can you run LLMs on CPUs and get usable results? I spent three weeks benchmarking different models on consumer and server CPUs to find out. The short answer is yes -- with significant caveats. The long answer involves quantization, memory bandwidth bottlenecks, and knowing exactly which models are worth running on your hardware.

This isn't a theoretical discussion. I'll share actual tokens-per-second numbers, memory requirements, and the optimization techniques that make CPU inference viable for real workloads.

What Does Running an LLM on CPU Mean?

Definition: CPU inference for large language models means executing the model's forward pass entirely on the CPU using system RAM instead of GPU VRAM. This requires loading model weights into main memory and performing matrix multiplications using CPU instructions (AVX2, AVX-512, or ARM NEON) rather than GPU CUDA/tensor cores.

GPUs dominate LLM inference because they have thousands of cores optimized for parallel matrix operations. A high-end GPU like the A100 offers 2TB/s of memory bandwidth. A typical server CPU maxes out at 100-200GB/s. That bandwidth gap is the fundamental bottleneck for CPU inference -- LLM token generation is almost entirely memory-bandwidth-bound.

CPU vs GPU: Why the Performance Gap Exists

LLM inference has two phases: prefill (processing the input prompt) and decode (generating tokens one at a time). Prefill is compute-bound and parallelizes well on GPUs. Decode is memory-bandwidth-bound because each token requires reading the entire model weights from memory. This is why CPUs can be surprisingly competitive for decode -- they just need enough memory bandwidth.

Hardware	Memory Bandwidth	Compute (FP16)	RAM/VRAM	Cost
NVIDIA A100 80GB	2,039 GB/s	312 TFLOPS	80GB HBM2e	$10,000+
NVIDIA RTX 4090	1,008 GB/s	165 TFLOPS	24GB GDDR6X	$1,600
AMD EPYC 9654 (96-core)	460 GB/s (12-ch DDR5)	~15 TFLOPS (AVX-512)	Up to 6TB	$11,000
Intel Xeon w9-3595X (60-core)	307 GB/s (8-ch DDR5)	~10 TFLOPS (AMX)	Up to 4TB	$7,500
Apple M4 Max	546 GB/s	~25 TFLOPS	128GB unified	$3,200 (MacBook)
AMD Ryzen 9 7950X	83 GB/s (DDR5-5200)	~5 TFLOPS	Up to 128GB	$550

Watch out: Apple Silicon's unified memory architecture blurs the CPU/GPU line. The M4 Max achieves 546 GB/s memory bandwidth shared between CPU and GPU cores. Running llama.cpp on an M4 Max with the Metal backend uses the GPU cores, not the CPU. Pure CPU-only inference on Apple Silicon is significantly slower.

Benchmark Results: Real Tokens Per Second

All benchmarks use llama.cpp (the standard for CPU inference) with Q4_K_M quantization unless noted. Tests measure decode speed (tokens per second for generation) with a 512-token prompt and 256-token output.

7B Parameter Models (Llama 3 8B, Mistral 7B)

Hardware	Quantization	RAM Used	Tokens/sec	Time to First Token
RTX 4090 (GPU baseline)	Q4_K_M	5.4GB VRAM	105 t/s	0.08s
Ryzen 9 7950X (16-core)	Q4_K_M	5.4GB RAM	18 t/s	1.2s
Intel i7-13700K (16-core)	Q4_K_M	5.4GB RAM	14 t/s	1.5s
AMD EPYC 9654 (96-core)	Q4_K_M	5.4GB RAM	32 t/s	0.6s
Apple M4 Max (CPU only)	Q4_K_M	5.4GB	28 t/s	0.7s
Raspberry Pi 5 (8GB)	Q4_K_M	5.4GB	2.1 t/s	8.5s

13B-14B Parameter Models

Hardware	Quantization	RAM Used	Tokens/sec
RTX 4090 (GPU)	Q4_K_M	9.2GB VRAM	68 t/s
Ryzen 9 7950X	Q4_K_M	9.2GB RAM	10.5 t/s
EPYC 9654	Q4_K_M	9.2GB RAM	22 t/s
Apple M4 Max (CPU)	Q4_K_M	9.2GB	19 t/s

70B Parameter Models

Hardware	Quantization	RAM Used	Tokens/sec
2x RTX 4090 (GPU)	Q4_K_M	42GB VRAM	28 t/s
EPYC 9654 (96-core)	Q4_K_M	42GB RAM	6.5 t/s
Apple M4 Max 128GB (CPU)	Q4_K_M	42GB	8.2 t/s
Ryzen 9 7950X (64GB DDR5)	Q4_K_M	42GB RAM	3.1 t/s

Pro tip: For CPU inference, memory bandwidth matters more than core count beyond 8-16 cores. A 16-core chip with fast DDR5-6000 memory often beats a 64-core server chip with slower DDR4. Always check your memory configuration -- dual-channel vs quad-channel makes a 2x difference.

Quantization: The Key to CPU Inference

Quantization reduces model precision from 16-bit floating point to lower bit widths, shrinking model size and speeding up inference. Without quantization, CPU inference is impractical for anything above 3B parameters.

How Quantization Works

Start with the full-precision model -- typically FP16 or BF16, where a 7B model uses ~14GB
Group weights into blocks -- usually 32 or 64 weights per block
Compute a scale factor per block -- this preserves the dynamic range
Round each weight to the nearest low-bit value -- 4-bit (Q4) means 16 possible values per weight
Store the quantized weights and scale factors -- a Q4 model is roughly 25-30% the size of FP16

Quality impact by quantization level:

Quantization	Bits/Weight	7B Model Size	Quality vs FP16	Best For
Q8_0	8.5	7.7GB	99.5% -- nearly lossless	When RAM allows
Q6_K	6.6	5.9GB	99% -- minimal loss	Good quality/size balance
Q5_K_M	5.7	5.1GB	98% -- slight degradation	Recommended default
Q4_K_M	4.8	4.4GB	96% -- noticeable on complex tasks	Speed-optimized
Q3_K_M	3.9	3.5GB	90% -- visible quality loss	Very low RAM only
Q2_K	3.4	3.0GB	80% -- significant degradation	Not recommended

Optimization Strategies for CPU Inference

Step 1: Maximize Memory Bandwidth

Use all available memory channels. A Ryzen system with single-channel DDR5 gets half the tokens/sec of the same chip with dual-channel. Server CPUs with 8 or 12 memory channels have a massive advantage. DDR5-6000 offers roughly 15% more bandwidth than DDR5-4800.

Step 2: Use the Right Instruction Set

Ensure llama.cpp is compiled with AVX2 (minimum) or AVX-512 support. AVX-512 provides 20-30% speedup over AVX2 on Intel Xeon and AMD EPYC processors. Check with lscpu | grep avx on Linux. On ARM, NEON and SVE2 instructions are used automatically.

Step 3: Pin Threads to Physical Cores

Hyperthreading/SMT doesn't help LLM inference -- it can actually hurt performance. Set thread count to physical core count: ./llama-cli -t 16 for a 16-core CPU. On NUMA systems, pin to a single socket with numactl --cpunodebind=0.

Step 4: Use Memory-Mapped Loading

Llama.cpp uses mmap by default, which lets the OS manage model loading efficiently. Don't disable this unless you have a specific reason. It allows the OS to page parts of the model in and out if you're memory-constrained.

Step 5: Consider Speculative Decoding

Use a small draft model (1-3B parameters) to propose tokens, then verify them in batch with the larger model. This can improve effective tokens/sec by 1.5-2x on CPU, since batch verification parallelizes better than sequential decoding.

Cost Comparison: CPU vs GPU vs Cloud

Setup	Hardware Cost	Monthly Power	Tokens/sec (7B Q4)	Cost per 1M Tokens
Ryzen 9 + 64GB DDR5	$1,200	$25	18 t/s	$0.004
RTX 4090 + system	$2,800	$45	105 t/s	$0.001
EPYC server (96-core)	$18,000	$80	32 t/s	$0.008
A100 cloud (on-demand)	$0	$2,200	150 t/s	$0.012
GPT-4o API	$0	$0	N/A	$2.50 (input)

Pro tip: CPU inference becomes cost-effective when your usage is consistent and moderate -- say 10,000-100,000 tokens per day. Below that, API calls are cheaper. Above that, a GPU pays for itself quickly. The sweet spot for CPU-only is development, testing, and low-traffic internal tools.

Practical Use Cases for CPU Inference

Local development and testing -- run a 7B model on your development machine without needing a GPU. 18 tokens/sec is perfectly usable for testing prompts and pipelines.
Edge deployment -- embedded systems, IoT gateways, or air-gapped environments where GPUs aren't available. A Raspberry Pi 5 can run a 3B model at conversational speeds.
Batch processing overnight -- if latency doesn't matter, a CPU can churn through thousands of requests. 18 t/s on a 7B model processes ~1.5M tokens per day.
Privacy-sensitive applications -- healthcare, legal, and financial use cases where data can't leave the premises. CPU servers are available in any data center.
Fallback infrastructure -- use CPU inference as a degraded-mode fallback when GPU instances are unavailable or during scaling events.

Frequently Asked Questions

Can you run ChatGPT-level models on CPU?

Not the full GPT-4 class models -- those have hundreds of billions of parameters and require massive GPU clusters. However, open-source models like Llama 3 8B and Mistral 7B run on consumer CPUs and produce quality comparable to GPT-3.5 for many tasks. With quantization, a 7B model needs just 4-5GB of RAM and generates 14-18 tokens per second on modern desktop CPUs, which is fast enough for interactive use.

How much RAM do I need for CPU inference?

For a Q4-quantized model, multiply the parameter count by 0.6-0.7 to get the approximate RAM requirement in GB. A 7B model needs about 5GB, a 13B model needs 9GB, and a 70B model needs 42GB. You'll also need 1-2GB overhead for the KV cache and application. For a 7B model, 16GB of system RAM is comfortable. For 70B, you'll need at least 64GB.

Is llama.cpp the only option for CPU inference?

Llama.cpp is the most popular and optimized option, but alternatives exist. Ollama wraps llama.cpp in a user-friendly interface. vLLM supports CPU inference with OpenVINO backend. CTranslate2 offers optimized CPU inference for specific model architectures. For Python-native workflows, ctransformers provides bindings. llama.cpp remains the performance leader for pure CPU inference across the widest range of models.

Does CPU inference quality differ from GPU inference?

At the same precision (FP16), CPU and GPU inference produce identical outputs -- the math is the same. The quality difference comes from quantization, not the hardware. A Q4-quantized model on CPU produces the same output as a Q4 model on GPU. If you run FP16 on both (assuming enough RAM), the outputs are bit-for-bit identical. The trade-off is purely speed, not quality.

Can I split inference between CPU and GPU?

Yes. llama.cpp supports offloading specific layers to the GPU while keeping others on CPU. If you have a GPU with 8GB VRAM and a 13B model that needs 9GB, you can offload 80% of layers to GPU and keep the rest on CPU. This gives you most of the GPU speed benefit. Use the -ngl flag to control how many layers are offloaded to GPU.

What about Intel's AMX and Gaudi accelerators?

Intel's Advanced Matrix Extensions (AMX) in 4th and 5th gen Xeon processors accelerate INT8/BF16 matrix operations, improving inference speed by 2-3x over AVX-512 for quantized models. Intel Gaudi 2 accelerators are dedicated AI chips competing with NVIDIA GPUs at lower price points. Both are viable options, but software ecosystem maturity (drivers, framework support) still lags behind NVIDIA CUDA significantly.

The Verdict

CPU inference is real, practical, and getting better every quarter as quantization techniques improve and CPUs gain dedicated AI instructions. For 7B-13B models, a modern desktop CPU delivers interactive speeds that are perfectly usable for development, internal tools, and moderate-traffic applications. For 70B+ models, you'll want server-grade hardware with maximum memory bandwidth.

Don't dismiss CPU inference as a toy. But don't expect it to replace GPUs for high-throughput production workloads either. The right approach is matching your hardware to your actual requirements -- and for a surprising number of use cases, your existing CPU is enough to get started today.

Can You Run LLMs Without GPU? CPU Benchmarks & Reality Check