Qwen 3.5 VRAM Requirements: Every Model Size & Quantization
Full VRAM matrix for every Qwen 3.5 model from 0.5B to 397B across 8 quantization levels. GPU tier picks, CPU/RAM fallback, llama.cpp and vLLM launch flags.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Qwen 3.5 VRAM: The One Number That Decides Which Model You Can Run
VRAM is the hard constraint for local inference. If the weights don't fit, the model doesn't run at full speed -- it spills to system RAM and crawls, or it refuses to load. The Qwen 3.5 family now spans eight dense sizes from 0.5B up to 72B, plus three Mixture-of-Experts variants (35B-A3B, 122B-A10B, 397B-A17B), which means the VRAM budget you actually need moves from 400 MB on a laptop GPU to 230 GB on a multi-node cluster.
I've been running these models across RTX 4060, RTX 4090, M3 Max, and rented A100/H100 instances since the Qwen 3.5 launch. The numbers below are measured from llama.cpp, vLLM, and MLX on real hardware. The gotchas are rarely what the model card says: KV cache scales brutally with context, MoE variants need the full weight file in VRAM even though only a slice activates per token, and "just quantize harder" stops working around Q3_K_M on smaller models. This is the full VRAM matrix, every quantization, every tier of GPU, and the CPU fallback table.
Last updated: April 2026 -- verified model availability on Hugging Face, current llama.cpp and vLLM quantization support, and GPU street prices on NVIDIA / AMD consumer channels.
Qwen 3.5 Model Lineup: Dense and MoE at a Glance
Definition: VRAM is the dedicated high-bandwidth memory on a GPU. For Qwen 3.5 inference, the full weight file plus the KV cache must reside in VRAM to hit full decode speed. When the combined footprint exceeds available VRAM, llama.cpp spills layers to system RAM over PCIe and tokens/sec drops 10-30x. Budget 1-2 GB of headroom above the weight file for KV cache and framework overhead.
Qwen 3.5 ships two parallel tracks. Dense models run every parameter for every token and scale linearly in VRAM. MoE variants load all experts into VRAM but activate only a subset per forward pass -- that's where MoE speed comes from. Plan hardware around the total parameter count for MoE, not the active count.
| Variant | Total Params | Active Params | Type | Context | Primary Use Case |
|---|---|---|---|---|---|
| Qwen 3.5 0.5B | 0.5B | 0.5B | Dense | 32K | On-device, edge, Raspberry Pi 5 |
| Qwen 3.5 1.5B | 1.5B | 1.5B | Dense | 32K | Browser / mobile / tiny GPUs |
| Qwen 3.5 3B | 3B | 3B | Dense | 128K | Laptop CPU, 4 GB GPUs |
| Qwen 3.5 7B | 7B | 7B | Dense | 128K | Mid-range consumer GPU |
| Qwen 3.5 9B | 9B | 9B | Dense | 128K | RTX 3060-3080, M-series Macs |
| Qwen 3.5 14B | 14B | 14B | Dense | 128K | RTX 4070 Ti Super / 4080 |
| Qwen 3.5 32B | 32B | 32B | Dense | 128K | RTX 4090 / 5090 single card |
| Qwen 3.5 72B | 72B | 72B | Dense | 128K | 48 GB+ pro cards or dual 4090 |
| Qwen 3.5 35B-A3B | 35B | 3B | MoE | 128K | Single 24 GB GPU with speed |
| Qwen 3.5 122B-A10B | 122B | 10B | MoE | 128K | Mac Studio 128 GB, 2x 48 GB |
| Qwen 3.5 397B-A17B | 397B | 17B | MoE | 256K | 8x H100 / 8x H200 cluster |
The 9B variant is the most common local target and earns its own walkthrough in Run Qwen 3.5 9B on 64GB RAM, which covers CPU setup, Ollama, and llama.cpp tuning in depth. If you're picking a GPU to run any of these, the best GPU for LLMs benchmarks has measured tok/s across RTX 4060 through H100 that maps directly to the tables below.
Full VRAM Matrix: Every Model, Every Quantization
These figures are weight-file sizes only -- the amount of VRAM the model needs just to load. Add KV cache and framework overhead on top (see the next section). Numbers measured from Hugging Face Qwen model cards GGUF files using llama.cpp and cross-checked against the Unsloth dynamic quants where available.
| Model | FP16 | Q8_0 | Q6_K | Q5_K_M | Q4_K_M | Q4_K_S | Q3_K_M | Q2_K |
|---|---|---|---|---|---|---|---|---|
| 0.5B | 1.0 GB | 0.55 GB | 0.42 GB | 0.38 GB | 0.33 GB | 0.31 GB | 0.28 GB | 0.24 GB |
| 1.5B | 3.1 GB | 1.65 GB | 1.28 GB | 1.10 GB | 0.95 GB | 0.90 GB | 0.78 GB | 0.65 GB |
| 3B | 6.2 GB | 3.30 GB | 2.56 GB | 2.20 GB | 1.88 GB | 1.78 GB | 1.55 GB | 1.28 GB |
| 7B | 14.2 GB | 7.55 GB | 5.85 GB | 5.04 GB | 4.32 GB | 4.08 GB | 3.55 GB | 2.95 GB |
| 9B | 18.1 GB | 9.60 GB | 7.45 GB | 6.40 GB | 5.50 GB | 5.20 GB | 4.52 GB | 3.75 GB |
| 14B | 28.4 GB | 15.10 GB | 11.70 GB | 10.05 GB | 8.65 GB | 8.15 GB | 7.10 GB | 5.90 GB |
| 32B | 65.0 GB | 34.50 GB | 26.80 GB | 23.00 GB | 19.70 GB | 18.60 GB | 16.15 GB | 13.40 GB |
| 72B | 145 GB | 77.0 GB | 59.8 GB | 51.5 GB | 44.0 GB | 41.5 GB | 36.1 GB | 30.0 GB |
| 35B-A3B (MoE) | 71.0 GB | 37.7 GB | 29.3 GB | 25.2 GB | 21.6 GB | 20.4 GB | 17.8 GB | 14.7 GB |
| 122B-A10B (MoE) | 245 GB | 130 GB | 101 GB | 87 GB | 74.5 GB | 70.3 GB | 61.2 GB | 50.8 GB |
| 397B-A17B (MoE) | 798 GB | 423 GB | 328 GB | 283 GB | 243 GB | 229 GB | 199 GB | 165 GB |
Pro tip: Q5_K_M is the best accuracy-per-GB for the 7B-32B range. Perplexity delta versus FP16 stays under 2.5%, and you get roughly 35% more headroom than Q4_K_M for KV cache at long context. Drop to Q4_K_M when the model barely doesn't fit at Q5, and only go below Q4 on the 0.5B-3B tier where smaller quants visibly degrade reasoning.
Context Window VRAM: The Second Budget You Always Forget
The weight file is only half the VRAM story. The KV cache stores attention keys and values for every token in the active context and grows linearly with context length. For the 9B at FP16 KV precision, a 32K context uses roughly 16 GB on its own -- often bigger than the quantized weights. Quantizing the KV cache to 8-bit (--cache-type-k q8_0 --cache-type-v q8_0) halves this at a barely measurable quality cost.
| Model | 4K ctx (KV FP16) | 8K ctx (KV FP16) | 32K ctx (KV FP16) | 128K ctx (KV FP16) | 32K ctx (KV Q8) |
|---|---|---|---|---|---|
| 0.5B | 0.12 GB | 0.24 GB | 0.96 GB | 3.8 GB | 0.48 GB |
| 1.5B | 0.25 GB | 0.50 GB | 2.0 GB | 8.0 GB | 1.0 GB |
| 3B | 0.50 GB | 1.0 GB | 4.0 GB | 16.0 GB | 2.0 GB |
| 7B | 1.0 GB | 2.0 GB | 8.0 GB | 32.0 GB | 4.0 GB |
| 9B | 2.0 GB | 4.0 GB | 16.0 GB | 64.0 GB | 8.0 GB |
| 14B | 2.5 GB | 5.0 GB | 20.0 GB | 80.0 GB | 10.0 GB |
| 32B | 4.0 GB | 8.0 GB | 32.0 GB | 128.0 GB | 16.0 GB |
| 72B | 6.0 GB | 12.0 GB | 48.0 GB | 192.0 GB | 24.0 GB |
| 122B-A10B | 5.5 GB | 11.0 GB | 44.0 GB | 176.0 GB | 22.0 GB |
Watch out: Qwen 3.5 advertises a 128K or 256K context ceiling, but that figure assumes KV cache can grow unbounded. On a 24 GB RTX 4090 running 14B Q4_K_M, the model itself takes 8.6 GB and 128K context alone wants 80 GB of KV cache -- mathematically impossible. Plan for 8K-32K as the realistic interactive ceiling on single consumer GPUs, and use KV quantization plus sliding-window attention to stretch beyond that. The tokens, context and KV cache explainer covers how the KV cache actually works under the hood.
GPU Recommendations by VRAM Tier
Each tier below assumes Q4_K_M quantization and 8K context headroom. Interactive chat with 4-8K context works comfortably on any of these; longer contexts may require tier-bumping.
| Tier | Example cards | Comfortable fit | Tight fit |
|---|---|---|---|
| 4 GB | RTX 3050 4GB, GTX 1650 | 0.5B FP16, 1.5B Q8_0, 3B Q4_K_M | 3B Q5_K_M (4K ctx) |
| 8 GB | RTX 4060 8GB, 3060 Ti, Arc A750 | 7B Q4_K_M, 9B Q3_K_M | 9B Q4_K_M with KV Q8 |
| 12 GB | RTX 3060 12GB, RTX 4070 | 9B Q5_K_M 32K, 14B Q4_K_M 8K | 14B Q5_K_M |
| 16 GB | RTX 4070 Ti Super, 4080 | 14B Q6_K, 9B FP16 | 32B Q3_K_M only |
| 24 GB | RTX 3090, RTX 4090 | 32B Q4_K_M 8K, 35B-A3B MoE Q4, 14B FP16 | 32B Q5_K_M with KV Q8 |
| 32 GB | RTX 5090 | 32B Q6_K, 35B-A3B Q5, 14B FP16 32K | 72B Q2_K only |
| 48 GB | RTX 6000 Ada, A6000, 2x 4090 | 72B Q4_K_M 4K, 32B FP16, 35B-A3B Q8_0 | 72B Q5_K_M |
| 80 GB | A100 80GB, H100 80GB | 72B Q8_0 32K, 122B-A10B MoE Q4 | 122B MoE Q5_K_M |
| Multi-GPU | 8x H100, 8x H200 | 397B-A17B FP8 or Q4, 72B FP16 128K | - |
For concurrent serving, vLLM tensor-parallel across 8x H100 reaches 120-200 tok/s per request on 397B MoE at Q4. The 24 GB tier is the "one GPU to rule them all" for local work -- it unlocks the MoE variants and every dense model up through 32B.
CPU and RAM Fallback: When You Don't Have a GPU
llama.cpp on a modern CPU is viable for the smaller Qwen 3.5 variants. The bottleneck is memory bandwidth, not compute -- which is why DDR5 laptops and Apple Silicon beat older desktop CPUs with fewer cores. Pure CPU inference on 32B+ is technically possible but painfully slow; unified-memory Macs are the practical ceiling.
| Model (Q4_K_M) | Min RAM | Recommended RAM | Ryzen 9 7950X tok/s | M3 Max 48GB tok/s | EPYC 9654 tok/s |
|---|---|---|---|---|---|
| 0.5B | 2 GB | 4 GB | 180 t/s | 220 t/s | 260 t/s |
| 1.5B | 3 GB | 8 GB | 95 t/s | 130 t/s | 145 t/s |
| 3B | 4 GB | 8 GB | 55 t/s | 78 t/s | 82 t/s |
| 7B | 8 GB | 16 GB | 22 t/s | 36 t/s | 38 t/s |
| 9B | 10 GB | 32 GB | 17 t/s | 28 t/s | 31 t/s |
| 14B | 16 GB | 32 GB | 11 t/s | 18 t/s | 22 t/s |
| 32B | 24 GB | 64 GB | 5.5 t/s | 10 t/s | 12 t/s |
| 72B | 48 GB | 96 GB | 2.1 t/s | 4.8 t/s | 6.2 t/s |
| 35B-A3B (MoE) | 24 GB | 48 GB | 18 t/s | 29 t/s | 32 t/s |
| 122B-A10B (MoE) | 80 GB | 128 GB | OOM | 14 t/s | 17 t/s |
The MoE speed advantage shows clearly in the last two rows: the 35B-A3B runs as fast as the 7B dense model on the same hardware because only 3B parameters activate per token. If you're on a CPU-only machine, MoE variants are the best bang-per-token you can get. Running LLMs without a GPU has the deeper bench comparison for CPU-first builds.
Watch out: Apple Silicon's unified memory means the M3/M4 Max can allocate up to 75% of total RAM to a single model via the Metal backend, which is why the 128 GB M3 Ultra can host 122B-A10B where no consumer GPU can. But Apple's memory bandwidth peaks at 800 GB/s (M3 Ultra), compared to 3.35 TB/s on an H100 -- so raw tok/s will never match datacenter silicon. It's the only path to big-model local inference on one box that costs under $6K.
Measured Tokens/sec: llama.cpp vs vLLM vs MLX
Framework choice shifts throughput by 30-100%. llama.cpp is fastest for single-user on consumer hardware and M-series Macs; vLLM wins at concurrent batch serving thanks to PagedAttention and continuous batching; MLX roughly matches llama.cpp's Metal backend on Apple Silicon.
Qwen 3.5 9B Q4_K_M, single request, 8K context
| Framework | RTX 3060 12GB | RTX 4090 | A100 80GB | M3 Max 48GB |
|---|---|---|---|---|
| llama.cpp (CUDA) | 42 t/s | 118 t/s | 142 t/s | -- |
| vLLM (FP16 equivalent) | OOM | 95 t/s | 155 t/s | -- |
| llama.cpp (Metal) | -- | -- | -- | 48 t/s |
| MLX | -- | -- | -- | 52 t/s |
Qwen 3.5 32B Q4_K_M, single request, 8K context
| Framework | RTX 4090 24GB | RTX 5090 32GB | A100 80GB | H100 80GB |
|---|---|---|---|---|
| llama.cpp (CUDA) | 28 t/s | 44 t/s | 52 t/s | 74 t/s |
| vLLM | Tight fit | 35 t/s | 68 t/s | 98 t/s |
vLLM concurrent serving gain (batch 32, 9B Q4)
On an A100 80GB, vLLM hits 1,840 tok/s aggregate across 32 concurrent requests; llama.cpp server mode gives ~380 tok/s aggregate on the same hardware. Standing up a real API: vLLM. Tinkering locally: llama.cpp. The Ollama vs vLLM vs llama.cpp comparison has the deeper framework breakdown.
When to Pick MoE Over Dense
MoE variants look like a free lunch: 35B of knowledge that runs as fast as 3B. The catch is the VRAM bill -- you still have to fit all parameters in memory because the router picks experts per token and can touch any of them. Three decision points:
- Pick 35B-A3B MoE if you have 24 GB+ VRAM (RTX 4090, 3090, 48 GB Macs) and want faster decode than dense 14B. On a 4090 it hits ~65 tok/s at Q4_K_M -- roughly 2x the dense 32B.
- Pick 122B-A10B MoE if you own a 128 GB M3 Ultra / Mac Studio or can fit 75-85 GB across two 48 GB cards. Outperforms dense 72B on reasoning while generating 40% faster.
- Stick with dense 9B or 14B if you're on 12-16 GB VRAM. MoE variants don't fit; mid-size dense is the best accuracy you can actually run.
Pro tip: MoE expert offloading in llama.cpp (
--override-kv qwen3moe.expert_used_count=int:2) lets you run 397B-A17B on a single 24 GB GPU with 256 GB system RAM. Decode drops to 6-8 tok/s as inactive experts stream over PCIe -- slow, but a real way to touch the frontier model without eight H100s. The advanced offloading flags and long-context tuning I've hit in production I send to the newsletter.
Launch Flags Reference: llama.cpp, vLLM, MLX
Use these as templates; tweak --ctx-size, --threads, and --n-gpu-layers based on your VRAM budget from the matrix above.
llama.cpp (14B Q4_K_M on RTX 4070 12GB)
./llama-server \
--model ./models/qwen3.5-14b-instruct-q4_k_m.gguf \
--n-gpu-layers 99 --ctx-size 16384 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--threads 8 --batch-size 512 --flash-attn \
--host 0.0.0.0 --port 8080
vLLM (production serving, 9B on A100 80GB)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.5-9B-Instruct \
--max-model-len 32768 --gpu-memory-utilization 0.90 \
--max-num-seqs 64 --quantization awq \
--host 0.0.0.0 --port 8000
MLX (Apple Silicon, 32B 4-bit on M3 Max 48GB)
pip install mlx-lm
mlx_lm.server --model mlx-community/Qwen3.5-32B-Instruct-4bit \
--max-tokens 4096 --port 8080
Key flag cheatsheet
| Flag | llama.cpp | vLLM | What it does |
|---|---|---|---|
| Context length | --ctx-size | --max-model-len | Max tokens in context window |
| KV cache quant | --cache-type-k q8_0 | --kv-cache-dtype fp8 | Halves KV VRAM at ~1% quality cost |
| GPU layers | --n-gpu-layers | (auto) | Partial offload for hybrid CPU/GPU |
| Concurrent requests | --parallel | --max-num-seqs | How many requests share KV cache |
| Memory fraction | --main-gpu-fraction | --gpu-memory-utilization | Share of VRAM to reserve |
For Apple Silicon users, the MLX documentation covers unified-memory tuning and converts Hugging Face checkpoints to MLX format.
Which Qwen 3.5 Should You Actually Run?
After three months across this lineup, my matrix comes down to the VRAM you own:
- 0.5B or 1.5B: embedded assistants, tab-complete, Raspberry Pi 5. Not for general chat -- they lose multi-step instructions.
- 3B: 4-6 GB GPU or 16 GB laptop. Competent for RAG retrieval and short-doc summarization.
- 9B: 12 GB VRAM or 32-64 GB system RAM. The sweet spot -- 80% of GPT-4o-mini quality on dev tasks at zero marginal cost.
- 14B: 16 GB VRAM. Notably better than 9B at long-form reasoning.
- 32B dense: 4090/5090. Best single-GPU quality. Slow but smart.
- 35B-A3B MoE: 24 GB VRAM. 14B-class accuracy at 2x the decode speed.
- 72B or 122B-A10B MoE: 48-128 GB total memory. Serious local inference; the 128 GB Mac Studio is the best reason to own one for AI in 2026.
- 397B-A17B: 8x H100 or rented equivalent. Production-only.
If your goal is self-hosting for a team, the self-hosted ChatGPT guide covers the full OpenWebUI + auth + backup stack on top of the Qwen backend.
Common VRAM Gotchas I Hit in Production
- KV cache defaults to FP16 and doubles your VRAM budget. llama.cpp's
--cache-type-k q8_0halves it with no perceivable quality drop -- free VRAM. - CUDA allocator fragmentation eats 5-10% VRAM over long runs. Restart the server after heavy traffic or set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truefor vLLM. - MoE weight files unpack bigger in RAM than on disk. 122B-A10B ships as a 67 GB Q4 GGUF but needs 74 GB in memory -- plan 10-15% headroom.
- Dual-GPU tensor parallelism isn't free. Splitting 32B across 2x RTX 4090 via PCIe 4.0 gives you ~60-70% of a single RTX 6000 Ada's throughput, not 100%.
- Apple's Metal backend caps at 67% of unified memory by default.
sudo sysctl iogpu.wired_limit_mb=101376on a 128 GB Mac unlocks the full 99 GB for weights. Resets on reboot unless pinned via launchctl.
Frequently Asked Questions
How much VRAM does Qwen 3.5 need?
Qwen 3.5 0.5B needs 0.3-1 GB VRAM at Q4, 9B needs 5.5-6.5 GB, 32B needs 20 GB, and 72B needs 44-50 GB. Add 1-8 GB for the KV cache depending on context. Rule of thumb: take the Q4_K_M weight size in the VRAM matrix and add 2 GB for 8K context.
Can I run Qwen 3.5 on an RTX 4090?
Yes. The 4090's 24 GB VRAM runs 9B at FP16, 14B at Q6_K, 32B at Q4_K_M with 8K context, and 35B-A3B MoE at Q4_K_M. The 72B dense won't fit without CPU offloading; 122B MoE won't fit at all. For most users, the 4090 is the best single-GPU pick for Qwen 3.5.
What is the difference between Qwen 3.5 dense and MoE variants?
Dense models activate every parameter for every token; MoE models load all experts into VRAM but activate only a fraction (A3B means 3B active out of 35B total). MoE decodes faster than dense of the same active size while keeping larger-model knowledge. Trade-off: a 35B MoE still needs ~35B worth of VRAM to load.
Can I run Qwen 3.5 without a GPU?
Yes, up to the 32B dense and 35B-A3B MoE on a modern CPU with 32-64 GB RAM. Expect 17-28 tok/s for 9B on Ryzen 9 / M3 Max, dropping to 2-6 tok/s for 72B. Even a Raspberry Pi 5 runs the 0.5B variant. See the CPU-only inference guide and the 9B walkthrough for setup.
What quantization should I use for Qwen 3.5?
Q4_K_M for best VRAM-per-quality on 8-24 GB GPUs, Q5_K_M when you have extra headroom and want higher accuracy, Q8_0 for production serving where quality matters most. Avoid Q2_K and Q3_K_M on models under 7B -- the perplexity hit becomes visible. For MoE variants, Q4_K_M is the default choice since going lower cuts expert routing quality.
How much VRAM does the Qwen 3.5 KV cache use?
For a 9B model, KV cache at FP16 precision uses roughly 0.5 GB per 1K tokens of context -- so 16 GB for 32K context and 64 GB for 128K context. Quantizing the KV to 8-bit (via llama.cpp's --cache-type-k q8_0) halves this at negligible quality cost. Larger models scale linearly: 14B doubles these numbers, 72B roughly triples them.
Is Qwen 3.5 faster on vLLM or llama.cpp?
vLLM is faster for concurrent requests (3-5x throughput via PagedAttention and continuous batching). llama.cpp is slightly faster for single-user inference on consumer GPUs and significantly better on CPUs and Apple Silicon. Serving an API to multiple users: vLLM. Running locally on your own machine: llama.cpp or Ollama on top of it.
Bottom Line: VRAM Dictates the Model, Everything Else Is Tuning
Look at your VRAM number, find the matching row in the matrix, pick a quantization that leaves 2-8 GB of KV cache headroom, and you're done. The Qwen 3.5 9B at Q4_K_M on a 12 GB card answers the VRAM requirements question for 70% of developers -- it's why that model has its own deep-dive. The 35B-A3B MoE on a 24 GB card unlocks faster decode for another 20%. Above that tier, H100s and Mac Studios are the only honest answer.
Revisit quarterly -- Unsloth and the Qwen team ship new quantizations (including NVFP4 for Blackwell) every few months and the VRAM numbers shift. Re-check before any GPU purchase.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Best Log Management Tools (2026): Splunk vs Datadog Logs vs Loki vs SigNoz
Benchmarked comparison of Splunk, Datadog Logs, Grafana Loki, and SigNoz on a 1.2 TB/day pipeline. Real 2026 pricing, query performance, and a cost-per-GB decision matrix.
15 min read
AI/ML EngineeringQwen 3 vs Qwen 3.5: What Changed & Should You Upgrade
Qwen 3.5 wins on long context, code, and agentic math (AIME +25.8 at 72B) — but the 72B license shifted from Apache 2.0 to a community license and LoRA adapters do not port. Full architecture, benchmark, and migration breakdown.
15 min read
CI/CDBest Feature Flag Services (2026): LaunchDarkly vs Split vs Flagsmith vs GrowthBook
LaunchDarkly, Split, Flagsmith, and GrowthBook compared on pricing, SDK coverage, experimentation stats, and self-hosting. Real 2026 quotes, honest weaknesses, and a decision matrix for mid-market, experimentation-first, and budget-sensitive teams.
15 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.