Can You Run LLMs Without GPU? CPU Benchmarks & Reality Check
A deep dive into running large language models on CPUs. Includes performance benchmarks, limitations, and optimization strategies.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

The GPU Shortage Made Me Try Something Different
GPU prices are absurd. A single NVIDIA A100 costs $10,000+, and cloud GPU instances burn through budgets faster than most teams expect. So the question keeps coming up: can you run LLMs on CPUs and get usable results? I spent three weeks benchmarking different models on consumer and server CPUs to find out. The short answer is yes -- with significant caveats. The long answer involves quantization, memory bandwidth bottlenecks, and knowing exactly which models are worth running on your hardware.
This isn't a theoretical discussion. I'll share actual tokens-per-second numbers, memory requirements, and the optimization techniques that make CPU inference viable for real workloads.
What Does Running an LLM on CPU Mean?
Definition: CPU inference for large language models means executing the model's forward pass entirely on the CPU using system RAM instead of GPU VRAM. This requires loading model weights into main memory and performing matrix multiplications using CPU instructions (AVX2, AVX-512, or ARM NEON) rather than GPU CUDA/tensor cores.
GPUs dominate LLM inference because they have thousands of cores optimized for parallel matrix operations. A high-end GPU like the A100 offers 2TB/s of memory bandwidth. A typical server CPU maxes out at 100-200GB/s. That bandwidth gap is the fundamental bottleneck for CPU inference -- LLM token generation is almost entirely memory-bandwidth-bound.
CPU vs GPU: Why the Performance Gap Exists
LLM inference has two phases: prefill (processing the input prompt) and decode (generating tokens one at a time). Prefill is compute-bound and parallelizes well on GPUs. Decode is memory-bandwidth-bound because each token requires reading the entire model weights from memory. This is why CPUs can be surprisingly competitive for decode -- they just need enough memory bandwidth.
| Hardware | Memory Bandwidth | Compute (FP16) | RAM/VRAM | Cost |
|---|---|---|---|---|
| NVIDIA A100 80GB | 2,039 GB/s | 312 TFLOPS | 80GB HBM2e | $10,000+ |
| NVIDIA RTX 4090 | 1,008 GB/s | 165 TFLOPS | 24GB GDDR6X | $1,600 |
| AMD EPYC 9654 (96-core) | 460 GB/s (12-ch DDR5) | ~15 TFLOPS (AVX-512) | Up to 6TB | $11,000 |
| Intel Xeon w9-3595X (60-core) | 307 GB/s (8-ch DDR5) | ~10 TFLOPS (AMX) | Up to 4TB | $7,500 |
| Apple M4 Max | 546 GB/s | ~25 TFLOPS | 128GB unified | $3,200 (MacBook) |
| AMD Ryzen 9 7950X | 83 GB/s (DDR5-5200) | ~5 TFLOPS | Up to 128GB | $550 |
Watch out: Apple Silicon's unified memory architecture blurs the CPU/GPU line. The M4 Max achieves 546 GB/s memory bandwidth shared between CPU and GPU cores. Running llama.cpp on an M4 Max with the Metal backend uses the GPU cores, not the CPU. Pure CPU-only inference on Apple Silicon is significantly slower.
Benchmark Results: Real Tokens Per Second
All benchmarks use llama.cpp (the standard for CPU inference) with Q4_K_M quantization unless noted. Tests measure decode speed (tokens per second for generation) with a 512-token prompt and 256-token output.
7B Parameter Models (Llama 3 8B, Mistral 7B)
| Hardware | Quantization | RAM Used | Tokens/sec | Time to First Token |
|---|---|---|---|---|
| RTX 4090 (GPU baseline) | Q4_K_M | 5.4GB VRAM | 105 t/s | 0.08s |
| Ryzen 9 7950X (16-core) | Q4_K_M | 5.4GB RAM | 18 t/s | 1.2s |
| Intel i7-13700K (16-core) | Q4_K_M | 5.4GB RAM | 14 t/s | 1.5s |
| AMD EPYC 9654 (96-core) | Q4_K_M | 5.4GB RAM | 32 t/s | 0.6s |
| Apple M4 Max (CPU only) | Q4_K_M | 5.4GB | 28 t/s | 0.7s |
| Raspberry Pi 5 (8GB) | Q4_K_M | 5.4GB | 2.1 t/s | 8.5s |
13B-14B Parameter Models
| Hardware | Quantization | RAM Used | Tokens/sec |
|---|---|---|---|
| RTX 4090 (GPU) | Q4_K_M | 9.2GB VRAM | 68 t/s |
| Ryzen 9 7950X | Q4_K_M | 9.2GB RAM | 10.5 t/s |
| EPYC 9654 | Q4_K_M | 9.2GB RAM | 22 t/s |
| Apple M4 Max (CPU) | Q4_K_M | 9.2GB | 19 t/s |
70B Parameter Models
| Hardware | Quantization | RAM Used | Tokens/sec |
|---|---|---|---|
| 2x RTX 4090 (GPU) | Q4_K_M | 42GB VRAM | 28 t/s |
| EPYC 9654 (96-core) | Q4_K_M | 42GB RAM | 6.5 t/s |
| Apple M4 Max 128GB (CPU) | Q4_K_M | 42GB | 8.2 t/s |
| Ryzen 9 7950X (64GB DDR5) | Q4_K_M | 42GB RAM | 3.1 t/s |
Pro tip: For CPU inference, memory bandwidth matters more than core count beyond 8-16 cores. A 16-core chip with fast DDR5-6000 memory often beats a 64-core server chip with slower DDR4. Always check your memory configuration -- dual-channel vs quad-channel makes a 2x difference.
Quantization: The Key to CPU Inference
Quantization reduces model precision from 16-bit floating point to lower bit widths, shrinking model size and speeding up inference. Without quantization, CPU inference is impractical for anything above 3B parameters.
How Quantization Works
- Start with the full-precision model -- typically FP16 or BF16, where a 7B model uses ~14GB
- Group weights into blocks -- usually 32 or 64 weights per block
- Compute a scale factor per block -- this preserves the dynamic range
- Round each weight to the nearest low-bit value -- 4-bit (Q4) means 16 possible values per weight
- Store the quantized weights and scale factors -- a Q4 model is roughly 25-30% the size of FP16
Quality impact by quantization level:
| Quantization | Bits/Weight | 7B Model Size | Quality vs FP16 | Best For |
|---|---|---|---|---|
| Q8_0 | 8.5 | 7.7GB | 99.5% -- nearly lossless | When RAM allows |
| Q6_K | 6.6 | 5.9GB | 99% -- minimal loss | Good quality/size balance |
| Q5_K_M | 5.7 | 5.1GB | 98% -- slight degradation | Recommended default |
| Q4_K_M | 4.8 | 4.4GB | 96% -- noticeable on complex tasks | Speed-optimized |
| Q3_K_M | 3.9 | 3.5GB | 90% -- visible quality loss | Very low RAM only |
| Q2_K | 3.4 | 3.0GB | 80% -- significant degradation | Not recommended |
Optimization Strategies for CPU Inference
Step 1: Maximize Memory Bandwidth
Use all available memory channels. A Ryzen system with single-channel DDR5 gets half the tokens/sec of the same chip with dual-channel. Server CPUs with 8 or 12 memory channels have a massive advantage. DDR5-6000 offers roughly 15% more bandwidth than DDR5-4800.
Step 2: Use the Right Instruction Set
Ensure llama.cpp is compiled with AVX2 (minimum) or AVX-512 support. AVX-512 provides 20-30% speedup over AVX2 on Intel Xeon and AMD EPYC processors. Check with lscpu | grep avx on Linux. On ARM, NEON and SVE2 instructions are used automatically.
Step 3: Pin Threads to Physical Cores
Hyperthreading/SMT doesn't help LLM inference -- it can actually hurt performance. Set thread count to physical core count: ./llama-cli -t 16 for a 16-core CPU. On NUMA systems, pin to a single socket with numactl --cpunodebind=0.
Step 4: Use Memory-Mapped Loading
llama.cpp uses mmap by default, which lets the OS manage model loading efficiently. Don't disable this unless you have a specific reason. It allows the OS to page parts of the model in and out if you're memory-constrained.
Step 5: Consider Speculative Decoding
Use a small draft model (1-3B parameters) to propose tokens, then verify them in batch with the larger model. This can improve effective tokens/sec by 1.5-2x on CPU, since batch verification parallelizes better than sequential decoding.
Cost Comparison: CPU vs GPU vs Cloud
| Setup | Hardware Cost | Monthly Power | Tokens/sec (7B Q4) | Cost per 1M Tokens |
|---|---|---|---|---|
| Ryzen 9 + 64GB DDR5 | $1,200 | $25 | 18 t/s | $0.004 |
| RTX 4090 + system | $2,800 | $45 | 105 t/s | $0.001 |
| EPYC server (96-core) | $18,000 | $80 | 32 t/s | $0.008 |
| A100 cloud (on-demand) | $0 | $2,200 | 150 t/s | $0.012 |
| GPT-4o API | $0 | $0 | N/A | $2.50 (input) |
Pro tip: CPU inference becomes cost-effective when your usage is consistent and moderate -- say 10,000-100,000 tokens per day. Below that, API calls are cheaper. Above that, a GPU pays for itself quickly. The sweet spot for CPU-only is development, testing, and low-traffic internal tools.
Practical Use Cases for CPU Inference
- Local development and testing -- run a 7B model on your development machine without needing a GPU. 18 tokens/sec is perfectly usable for testing prompts and pipelines.
- Edge deployment -- embedded systems, IoT gateways, or air-gapped environments where GPUs aren't available. A Raspberry Pi 5 can run a 3B model at conversational speeds.
- Batch processing overnight -- if latency doesn't matter, a CPU can churn through thousands of requests. 18 t/s on a 7B model processes ~1.5M tokens per day.
- Privacy-sensitive applications -- healthcare, legal, and financial use cases where data can't leave the premises. CPU servers are available in any data center.
- Fallback infrastructure -- use CPU inference as a degraded-mode fallback when GPU instances are unavailable or during scaling events.
Frequently Asked Questions
Can you run ChatGPT-level models on CPU?
Not the full GPT-4 class models -- those have hundreds of billions of parameters and require massive GPU clusters. However, open-source models like Llama 3 8B and Mistral 7B run on consumer CPUs and produce quality comparable to GPT-3.5 for many tasks. With quantization, a 7B model needs just 4-5GB of RAM and generates 14-18 tokens per second on modern desktop CPUs, which is fast enough for interactive use.
How much RAM do I need for CPU inference?
For a Q4-quantized model, multiply the parameter count by 0.6-0.7 to get the approximate RAM requirement in GB. A 7B model needs about 5GB, a 13B model needs 9GB, and a 70B model needs 42GB. You'll also need 1-2GB overhead for the KV cache and application. For a 7B model, 16GB of system RAM is comfortable. For 70B, you'll need at least 64GB.
Is llama.cpp the only option for CPU inference?
llama.cpp is the most popular and optimized option, but alternatives exist. Ollama wraps llama.cpp in a user-friendly interface. vLLM supports CPU inference with OpenVINO backend. CTranslate2 offers optimized CPU inference for specific model architectures. For Python-native workflows, ctransformers provides bindings. llama.cpp remains the performance leader for pure CPU inference across the widest range of models.
Does CPU inference quality differ from GPU inference?
At the same precision (FP16), CPU and GPU inference produce identical outputs -- the math is the same. The quality difference comes from quantization, not the hardware. A Q4-quantized model on CPU produces the same output as a Q4 model on GPU. If you run FP16 on both (assuming enough RAM), the outputs are bit-for-bit identical. The trade-off is purely speed, not quality.
Can I split inference between CPU and GPU?
Yes. llama.cpp supports offloading specific layers to the GPU while keeping others on CPU. If you have a GPU with 8GB VRAM and a 13B model that needs 9GB, you can offload 80% of layers to GPU and keep the rest on CPU. This gives you most of the GPU speed benefit. Use the -ngl flag to control how many layers are offloaded to GPU.
What about Intel's AMX and Gaudi accelerators?
Intel's Advanced Matrix Extensions (AMX) in 4th and 5th gen Xeon processors accelerate INT8/BF16 matrix operations, improving inference speed by 2-3x over AVX-512 for quantized models. Intel Gaudi 2 accelerators are dedicated AI chips competing with NVIDIA GPUs at lower price points. Both are viable options, but software ecosystem maturity (drivers, framework support) still lags behind NVIDIA CUDA significantly.
The Verdict
CPU inference is real, practical, and getting better every quarter as quantization techniques improve and CPUs gain dedicated AI instructions. For 7B-13B models, a modern desktop CPU delivers interactive speeds that are perfectly usable for development, internal tools, and moderate-traffic applications. For 70B+ models, you'll want server-grade hardware with maximum memory bandwidth.
Don't dismiss CPU inference as a toy. But don't expect it to replace GPUs for high-throughput production workloads either. The right approach is matching your hardware to your actual requirements -- and for a surprising number of use cases, your existing CPU is enough to get started today.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Run Qwen 3.5 9B on 64GB RAM: Complete Setup Guide
Step-by-step guide to running Qwen 3.5 9B on local hardware. Covers system requirements, optimization techniques, quantization, inference speed, and practical limitations for developers.
11 min read
AI/ML EngineeringAI Observability: How to Monitor and Debug LLM Applications
A practical guide to monitoring LLM applications in production -- input/output logging, cost tracking, quality metrics, and a comparison of LangSmith, Langfuse, and Arize.
10 min read
AI/ML EngineeringDeploying ML Models in Production: From Notebook to Kubernetes
End-to-end guide to deploying ML models -- from ONNX export and FastAPI serving to Kubernetes GPU workloads, canary deployments, and Prometheus monitoring.
9 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.