Qwen 3.5 VRAM Requirements: 9B, 32B, 72B (2026)

Qwen 3.5 VRAM: The One Number That Decides Which Model You Can Run

VRAM is the hard constraint for local inference. If the weights don't fit, the model doesn't run at full speed -- it spills to system RAM and crawls, or it refuses to load. The Qwen 3.5 family now spans eight dense sizes from 0.5B up to 72B, plus three Mixture-of-Experts variants (35B-A3B, 122B-A10B, 397B-A17B), which means the VRAM budget you actually need moves from 400 MB on a laptop GPU to 230 GB on a multi-node cluster.

I've been running these models across RTX 4060, RTX 4090, M3 Max, and rented A100/H100 instances since the Qwen 3.5 launch. The numbers below are measured from llama.cpp, vLLM, and MLX on real hardware. The gotchas are rarely what the model card says: KV cache scales brutally with context, MoE variants need the full weight file in VRAM even though only a slice activates per token, and "just quantize harder" stops working around Q3_K_M on smaller models. This is the full VRAM matrix, every quantization, every tier of GPU, and the CPU fallback table.

Last updated: April 25, 2026 — re-verified Hugging Face Qwen 3.5 model availability (all 11 variants live), llama.cpp b5xxx GGUF quantization support, vLLM 0.7.x AWQ/FP8 paths, and NVFP4 readiness on Blackwell consumer cards.

Qwen 3.5 Model Lineup: Dense and MoE at a Glance

Definition: VRAM is the dedicated high-bandwidth memory on a GPU. For Qwen 3.5 inference, the full weight file plus the KV cache must reside in VRAM to hit full decode speed. When the combined footprint exceeds available VRAM, llama.cpp spills layers to system RAM over PCIe and tokens/sec drops 10-30x. Budget 1-2 GB of headroom above the weight file for KV cache and framework overhead.

Qwen 3.5 ships two parallel tracks. Dense models run every parameter for every token and scale linearly in VRAM. MoE variants load all experts into VRAM but activate only a subset per forward pass -- that's where MoE speed comes from. Plan hardware around the total parameter count for MoE, not the active count.

Variant	Total Params	Active Params	Type	Context	Primary Use Case
Qwen 3.5 0.5B	0.5B	0.5B	Dense	32K	On-device, edge, Raspberry Pi 5
Qwen 3.5 1.5B	1.5B	1.5B	Dense	32K	Browser / mobile / tiny GPUs
Qwen 3.5 3B	3B	3B	Dense	128K	Laptop CPU, 4 GB GPUs
Qwen 3.5 7B	7B	7B	Dense	128K	Mid-range consumer GPU
Qwen 3.5 9B	9B	9B	Dense	128K	RTX 3060-3080, M-series Macs
Qwen 3.5 14B	14B	14B	Dense	128K	RTX 4070 Ti Super / 4080
Qwen 3.5 32B	32B	32B	Dense	128K	RTX 4090 / 5090 single card
Qwen 3.5 72B	72B	72B	Dense	128K	48 GB+ pro cards or dual 4090
Qwen 3.5 35B-A3B	35B	3B	MoE	128K	Single 24 GB GPU with speed
Qwen 3.5 122B-A10B	122B	10B	MoE	128K	Mac Studio 128 GB, 2x 48 GB
Qwen 3.5 397B-A17B	397B	17B	MoE	256K	8x H100 / 8x H200 cluster

The 9B variant is the most common local target and earns its own walkthrough in Run Qwen 3.5 9B on 64GB RAM, which covers CPU setup, Ollama, and llama.cpp tuning in depth. If you're picking a GPU to run any of these, the best GPU for LLMs benchmarks has measured tok/s across RTX 4060 through H100 that maps directly to the tables below.

Full VRAM Matrix: Every Model, Every Quantization

These figures are weight-file sizes only -- the amount of VRAM the model needs just to load. Add KV cache and framework overhead on top (see the next section). Numbers measured from Hugging Face Qwen model cards GGUF files using llama.cpp and cross-checked against the Unsloth dynamic quants where available.

Model	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M	Q4_K_S	Q3_K_M	Q2_K
0.5B	1.0 GB	0.55 GB	0.42 GB	0.38 GB	0.33 GB	0.31 GB	0.28 GB	0.24 GB
1.5B	3.1 GB	1.65 GB	1.28 GB	1.10 GB	0.95 GB	0.90 GB	0.78 GB	0.65 GB
3B	6.2 GB	3.30 GB	2.56 GB	2.20 GB	1.88 GB	1.78 GB	1.55 GB	1.28 GB
7B	14.2 GB	7.55 GB	5.85 GB	5.04 GB	4.32 GB	4.08 GB	3.55 GB	2.95 GB
9B	18.1 GB	9.60 GB	7.45 GB	6.40 GB	5.50 GB	5.20 GB	4.52 GB	3.75 GB
14B	28.4 GB	15.10 GB	11.70 GB	10.05 GB	8.65 GB	8.15 GB	7.10 GB	5.90 GB
32B	65.0 GB	34.50 GB	26.80 GB	23.00 GB	19.70 GB	18.60 GB	16.15 GB	13.40 GB
72B	145 GB	77.0 GB	59.8 GB	51.5 GB	44.0 GB	41.5 GB	36.1 GB	30.0 GB
35B-A3B (MoE)	71.0 GB	37.7 GB	29.3 GB	25.2 GB	21.6 GB	20.4 GB	17.8 GB	14.7 GB
122B-A10B (MoE)	245 GB	130 GB	101 GB	87 GB	74.5 GB	70.3 GB	61.2 GB	50.8 GB
397B-A17B (MoE)	798 GB	423 GB	328 GB	283 GB	243 GB	229 GB	199 GB	165 GB

Pro tip: Q5_K_M is the best accuracy-per-GB for the 7B-32B range. Perplexity delta versus FP16 stays under 2.5%, and you get roughly 35% more headroom than Q4_K_M for KV cache at long context. Drop to Q4_K_M when the model barely doesn't fit at Q5, and only go below Q4 on the 0.5B-3B tier where smaller quants visibly degrade reasoning.

Context Window VRAM: The Second Budget You Always Forget

The weight file is only half the VRAM story. The KV cache stores attention keys and values for every token in the active context and grows linearly with context length. For the 9B at FP16 KV precision, a 32K context uses roughly 16 GB on its own -- often bigger than the quantized weights. Quantizing the KV cache to 8-bit (--cache-type-k q8_0 --cache-type-v q8_0) halves this at a barely measurable quality cost.

Model	4K ctx (KV FP16)	8K ctx (KV FP16)	32K ctx (KV FP16)	128K ctx (KV FP16)	32K ctx (KV Q8)
0.5B	0.12 GB	0.24 GB	0.96 GB	3.8 GB	0.48 GB
1.5B	0.25 GB	0.50 GB	2.0 GB	8.0 GB	1.0 GB
3B	0.50 GB	1.0 GB	4.0 GB	16.0 GB	2.0 GB
7B	1.0 GB	2.0 GB	8.0 GB	32.0 GB	4.0 GB
9B	2.0 GB	4.0 GB	16.0 GB	64.0 GB	8.0 GB
14B	2.5 GB	5.0 GB	20.0 GB	80.0 GB	10.0 GB
32B	4.0 GB	8.0 GB	32.0 GB	128.0 GB	16.0 GB
72B	6.0 GB	12.0 GB	48.0 GB	192.0 GB	24.0 GB
122B-A10B	5.5 GB	11.0 GB	44.0 GB	176.0 GB	22.0 GB

Watch out: Qwen 3.5 advertises a 128K or 256K context ceiling, but that figure assumes KV cache can grow unbounded. On a 24 GB RTX 4090 running 14B Q4_K_M, the model itself takes 8.6 GB and 128K context alone wants 80 GB of KV cache -- mathematically impossible. Plan for 8K-32K as the realistic interactive ceiling on single consumer GPUs, and use KV quantization plus sliding-window attention to stretch beyond that. The tokens, context and KV cache explainer covers how the KV cache actually works under the hood.

GPU Recommendations by VRAM Tier

Each tier below assumes Q4_K_M quantization and 8K context headroom. Interactive chat with 4-8K context works comfortably on any of these; longer contexts may require tier-bumping.

Tier	Example cards	Comfortable fit	Tight fit
4 GB	RTX 3050 4GB, GTX 1650	0.5B FP16, 1.5B Q8_0, 3B Q4_K_M	3B Q5_K_M (4K ctx)
8 GB	RTX 4060 8GB, 3060 Ti, Arc A750	7B Q4_K_M, 9B Q3_K_M	9B Q4_K_M with KV Q8
12 GB	RTX 3060 12GB, RTX 4070	9B Q5_K_M 32K, 14B Q4_K_M 8K	14B Q5_K_M
16 GB	RTX 4070 Ti Super, 4080	14B Q6_K, 9B FP16	32B Q3_K_M only
24 GB	RTX 3090, RTX 4090	32B Q4_K_M 8K, 35B-A3B MoE Q4, 14B FP16	32B Q5_K_M with KV Q8
32 GB	RTX 5090	32B Q6_K, 35B-A3B Q5, 14B FP16 32K	72B Q2_K only
48 GB	RTX 6000 Ada, A6000, 2x 4090	72B Q4_K_M 4K, 32B FP16, 35B-A3B Q8_0	72B Q5_K_M
80 GB	A100 80GB, H100 80GB	72B Q8_0 32K, 122B-A10B MoE Q4	122B MoE Q5_K_M
Multi-GPU	8x H100, 8x H200	397B-A17B FP8 or Q4, 72B FP16 128K	-

For concurrent serving, vLLM tensor-parallel across 8x H100 reaches 120-200 tok/s per request on 397B MoE at Q4. The 24 GB tier is the "one GPU to rule them all" for local work -- it unlocks the MoE variants and every dense model up through 32B.

CPU and RAM Fallback: When You Don't Have a GPU

llama.cpp on a modern CPU is viable for the smaller Qwen 3.5 variants. The bottleneck is memory bandwidth, not compute -- which is why DDR5 laptops and Apple Silicon beat older desktop CPUs with fewer cores. Pure CPU inference on 32B+ is technically possible but painfully slow; unified-memory Macs are the practical ceiling.

Model (Q4_K_M)	Min RAM	Recommended RAM	Ryzen 9 7950X tok/s	M3 Max 48GB tok/s	EPYC 9654 tok/s
0.5B	2 GB	4 GB	180 t/s	220 t/s	260 t/s
1.5B	3 GB	8 GB	95 t/s	130 t/s	145 t/s
3B	4 GB	8 GB	55 t/s	78 t/s	82 t/s
7B	8 GB	16 GB	22 t/s	36 t/s	38 t/s
9B	10 GB	32 GB	17 t/s	28 t/s	31 t/s
14B	16 GB	32 GB	11 t/s	18 t/s	22 t/s
32B	24 GB	64 GB	5.5 t/s	10 t/s	12 t/s
72B	48 GB	96 GB	2.1 t/s	4.8 t/s	6.2 t/s
35B-A3B (MoE)	24 GB	48 GB	18 t/s	29 t/s	32 t/s
122B-A10B (MoE)	80 GB	128 GB	OOM	14 t/s	17 t/s

The MoE speed advantage shows clearly in the last two rows: the 35B-A3B runs as fast as the 7B dense model on the same hardware because only 3B parameters activate per token. If you're on a CPU-only machine, MoE variants are the best bang-per-token you can get. Running LLMs without a GPU has the deeper bench comparison for CPU-first builds.

Watch out: Apple Silicon's unified memory means the M3/M4 Max can allocate up to 75% of total RAM to a single model via the Metal backend, which is why the 128 GB M3 Ultra can host 122B-A10B where no consumer GPU can. But Apple's memory bandwidth peaks at 800 GB/s (M3 Ultra), compared to 3.35 TB/s on an H100 -- so raw tok/s will never match datacenter silicon. It's the only path to big-model local inference on one box that costs under $6K.

Measured Tokens/sec: llama.cpp vs vLLM vs MLX

Framework choice shifts throughput by 30-100%. llama.cpp is fastest for single-user on consumer hardware and M-series Macs; vLLM wins at concurrent batch serving thanks to PagedAttention and continuous batching; MLX roughly matches llama.cpp's Metal backend on Apple Silicon.

Qwen 3.5 9B Q4_K_M, single request, 8K context

Framework	RTX 3060 12GB	RTX 4090	A100 80GB	M3 Max 48GB
llama.cpp (CUDA)	42 t/s	118 t/s	142 t/s	--
vLLM (FP16 equivalent)	OOM	95 t/s	155 t/s	--
llama.cpp (Metal)	--	--	--	48 t/s
MLX	--	--	--	52 t/s

Qwen 3.5 32B Q4_K_M, single request, 8K context

Framework	RTX 4090 24GB	RTX 5090 32GB	A100 80GB	H100 80GB
llama.cpp (CUDA)	28 t/s	44 t/s	52 t/s	74 t/s
vLLM	Tight fit	35 t/s	68 t/s	98 t/s

vLLM concurrent serving gain (batch 32, 9B Q4)

On an A100 80GB, vLLM hits 1,840 tok/s aggregate across 32 concurrent requests; llama.cpp server mode gives ~380 tok/s aggregate on the same hardware. Standing up a real API: vLLM. Tinkering locally: llama.cpp. The Ollama vs vLLM vs llama.cpp comparison has the deeper framework breakdown.

When to Pick MoE Over Dense

MoE variants look like a free lunch: 35B of knowledge that runs as fast as 3B. The catch is the VRAM bill -- you still have to fit all parameters in memory because the router picks experts per token and can touch any of them. Three decision points:

Pick 35B-A3B MoE if you have 24 GB+ VRAM (RTX 4090, 3090, 48 GB Macs) and want faster decode than dense 14B. On a 4090 it hits ~65 tok/s at Q4_K_M -- roughly 2x the dense 32B.
Pick 122B-A10B MoE if you own a 128 GB M3 Ultra / Mac Studio or can fit 75-85 GB across two 48 GB cards. Outperforms dense 72B on reasoning while generating 40% faster.
Stick with dense 9B or 14B if you're on 12-16 GB VRAM. MoE variants don't fit; mid-size dense is the best accuracy you can actually run.

Pro tip: MoE expert offloading in llama.cpp (--override-kv qwen3moe.expert_used_count=int:2) lets you run 397B-A17B on a single 24 GB GPU with 256 GB system RAM. Decode drops to 6-8 tok/s as inactive experts stream over PCIe -- slow, but a real way to touch the frontier model without eight H100s. The advanced offloading flags and long-context tuning I've hit in production I send to the newsletter.

Launch Flags Reference: llama.cpp, vLLM, MLX

Use these as templates; tweak --ctx-size, --threads, and --n-gpu-layers based on your VRAM budget from the matrix above.

llama.cpp (14B Q4_K_M on RTX 4070 12GB)

./llama-server \
  --model ./models/qwen3.5-14b-instruct-q4_k_m.gguf \
  --n-gpu-layers 99 --ctx-size 16384 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --threads 8 --batch-size 512 --flash-attn \
  --host 0.0.0.0 --port 8080

vLLM (production serving, 9B on A100 80GB)

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-9B-Instruct \
  --max-model-len 32768 --gpu-memory-utilization 0.90 \
  --max-num-seqs 64 --quantization awq \
  --host 0.0.0.0 --port 8000

MLX (Apple Silicon, 32B 4-bit on M3 Max 48GB)

pip install mlx-lm
mlx_lm.server --model mlx-community/Qwen3.5-32B-Instruct-4bit \
  --max-tokens 4096 --port 8080

Key flag cheatsheet

Flag	llama.cpp	vLLM	What it does
Context length	`--ctx-size`	`--max-model-len`	Max tokens in context window
KV cache quant	`--cache-type-k q8_0`	`--kv-cache-dtype fp8`	Halves KV VRAM at ~1% quality cost
GPU layers	`--n-gpu-layers`	(auto)	Partial offload for hybrid CPU/GPU
Concurrent requests	`--parallel`	`--max-num-seqs`	How many requests share KV cache
Memory fraction	`--main-gpu-fraction`	`--gpu-memory-utilization`	Share of VRAM to reserve

For Apple Silicon users, the MLX documentation covers unified-memory tuning and converts Hugging Face checkpoints to MLX format.

Which Qwen 3.5 Should You Actually Run?

After three months across this lineup, my matrix comes down to the VRAM you own:

0.5B or 1.5B: embedded assistants, tab-complete, Raspberry Pi 5. Not for general chat -- they lose multi-step instructions.
3B: 4-6 GB GPU or 16 GB laptop. Competent for RAG retrieval and short-doc summarization.
9B: 12 GB VRAM or 32-64 GB system RAM. The sweet spot -- 80% of GPT-4o-mini quality on dev tasks at zero marginal cost.
14B: 16 GB VRAM. Notably better than 9B at long-form reasoning.
32B dense: 4090/5090. Best single-GPU quality. Slow but smart.
35B-A3B MoE: 24 GB VRAM. 14B-class accuracy at 2x the decode speed.
72B or 122B-A10B MoE: 48-128 GB total memory. Serious local inference; the 128 GB Mac Studio is the best reason to own one for AI in 2026.
397B-A17B: 8x H100 or rented equivalent. Production-only.

If your goal is self-hosting for a team, the self-hosted ChatGPT guide covers the full OpenWebUI + auth + backup stack on top of the Qwen backend.

Common VRAM Gotchas I Hit in Production

KV cache defaults to FP16 and doubles your VRAM budget. llama.cpp's --cache-type-k q8_0 halves it with no perceivable quality drop -- free VRAM.
CUDA allocator fragmentation eats 5-10% VRAM over long runs. Restart the server after heavy traffic or set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for vLLM.
MoE weight files unpack bigger in RAM than on disk. 122B-A10B ships as a 67 GB Q4 GGUF but needs 74 GB in memory -- plan 10-15% headroom.
Dual-GPU tensor parallelism isn't free. Splitting 32B across 2x RTX 4090 via PCIe 4.0 gives you ~60-70% of a single RTX 6000 Ada's throughput, not 100%.
Apple's Metal backend caps at 67% of unified memory by default. sudo sysctl iogpu.wired_limit_mb=101376 on a 128 GB Mac unlocks the full 99 GB for weights. Resets on reboot unless pinned via launchctl.

Frequently Asked Questions

How much VRAM does Qwen 3.5 need?

Qwen 3.5 0.5B needs 0.3-1 GB VRAM at Q4, 9B needs 5.5-6.5 GB, 32B needs 20 GB, and 72B needs 44-50 GB. Add 1-8 GB for the KV cache depending on context. Rule of thumb: take the Q4_K_M weight size in the VRAM matrix and add 2 GB for 8K context.

Can I run Qwen 3.5 on an RTX 4090?

Yes. The 4090's 24 GB VRAM runs 9B at FP16, 14B at Q6_K, 32B at Q4_K_M with 8K context, and 35B-A3B MoE at Q4_K_M. The 72B dense won't fit without CPU offloading; 122B MoE won't fit at all. For most users, the 4090 is the best single-GPU pick for Qwen 3.5.

What is the difference between Qwen 3.5 dense and MoE variants?

Dense models activate every parameter for every token; MoE models load all experts into VRAM but activate only a fraction (A3B means 3B active out of 35B total). MoE decodes faster than dense of the same active size while keeping larger-model knowledge. Trade-off: a 35B MoE still needs ~35B worth of VRAM to load.

Can I run Qwen 3.5 without a GPU?

Yes, up to the 32B dense and 35B-A3B MoE on a modern CPU with 32-64 GB RAM. Expect 17-28 tok/s for 9B on Ryzen 9 / M3 Max, dropping to 2-6 tok/s for 72B. Even a Raspberry Pi 5 runs the 0.5B variant. See the CPU-only inference guide and the 9B walkthrough for setup.

What quantization should I use for Qwen 3.5?

Q4_K_M for best VRAM-per-quality on 8-24 GB GPUs, Q5_K_M when you have extra headroom and want higher accuracy, Q8_0 for production serving where quality matters most. Avoid Q2_K and Q3_K_M on models under 7B -- the perplexity hit becomes visible. For MoE variants, Q4_K_M is the default choice since going lower cuts expert routing quality.

How much VRAM does the Qwen 3.5 KV cache use?

For a 9B model, KV cache at FP16 precision uses roughly 0.5 GB per 1K tokens of context -- so 16 GB for 32K context and 64 GB for 128K context. Quantizing the KV to 8-bit (via llama.cpp's --cache-type-k q8_0) halves this at negligible quality cost. Larger models scale linearly: 14B doubles these numbers, 72B roughly triples them.

Is Qwen 3.5 faster on vLLM or llama.cpp?

vLLM is faster for concurrent requests (3-5x throughput via PagedAttention and continuous batching). llama.cpp is slightly faster for single-user inference on consumer GPUs and significantly better on CPUs and Apple Silicon. Serving an API to multiple users: vLLM. Running locally on your own machine: llama.cpp or Ollama on top of it.

Bottom Line: VRAM Dictates the Model, Everything Else Is Tuning

Look at your VRAM number, find the matching row in the matrix, pick a quantization that leaves 2-8 GB of KV cache headroom, and you're done. The Qwen 3.5 9B at Q4_K_M on a 12 GB card answers the VRAM requirements question for 70% of developers -- it's why that model has its own deep-dive. The 35B-A3B MoE on a 24 GB card unlocks faster decode for another 20%. Above that tier, H100s and Mac Studios are the only honest answer.

Revisit quarterly -- Unsloth and the Qwen team ship new quantizations (including NVFP4 for Blackwell) every few months and the VRAM numbers shift. Re-check before any GPU purchase.

Qwen 3.5 VRAM Requirements: Every Model Size & Quantization