Skip to content
AI/ML Engineering

Qwen 3.5 VRAM Requirements: Every Model Size & Quantization

Full VRAM matrix for every Qwen 3.5 model from 0.5B to 397B across 8 quantization levels. GPU tier picks, CPU/RAM fallback, llama.cpp and vLLM launch flags.

A
Abhishek Patel16 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Qwen 3.5 VRAM Requirements: Every Model Size & Quantization
Qwen 3.5 VRAM Requirements: Every Model Size & Quantization

Qwen 3.5 VRAM: The One Number That Decides Which Model You Can Run

VRAM is the hard constraint for local inference. If the weights don't fit, the model doesn't run at full speed -- it spills to system RAM and crawls, or it refuses to load. The Qwen 3.5 family now spans eight dense sizes from 0.5B up to 72B, plus three Mixture-of-Experts variants (35B-A3B, 122B-A10B, 397B-A17B), which means the VRAM budget you actually need moves from 400 MB on a laptop GPU to 230 GB on a multi-node cluster.

I've been running these models across RTX 4060, RTX 4090, M3 Max, and rented A100/H100 instances since the Qwen 3.5 launch. The numbers below are measured from llama.cpp, vLLM, and MLX on real hardware. The gotchas are rarely what the model card says: KV cache scales brutally with context, MoE variants need the full weight file in VRAM even though only a slice activates per token, and "just quantize harder" stops working around Q3_K_M on smaller models. This is the full VRAM matrix, every quantization, every tier of GPU, and the CPU fallback table.

Last updated: April 2026 -- verified model availability on Hugging Face, current llama.cpp and vLLM quantization support, and GPU street prices on NVIDIA / AMD consumer channels.

Qwen 3.5 Model Lineup: Dense and MoE at a Glance

Definition: VRAM is the dedicated high-bandwidth memory on a GPU. For Qwen 3.5 inference, the full weight file plus the KV cache must reside in VRAM to hit full decode speed. When the combined footprint exceeds available VRAM, llama.cpp spills layers to system RAM over PCIe and tokens/sec drops 10-30x. Budget 1-2 GB of headroom above the weight file for KV cache and framework overhead.

Qwen 3.5 ships two parallel tracks. Dense models run every parameter for every token and scale linearly in VRAM. MoE variants load all experts into VRAM but activate only a subset per forward pass -- that's where MoE speed comes from. Plan hardware around the total parameter count for MoE, not the active count.

VariantTotal ParamsActive ParamsTypeContextPrimary Use Case
Qwen 3.5 0.5B0.5B0.5BDense32KOn-device, edge, Raspberry Pi 5
Qwen 3.5 1.5B1.5B1.5BDense32KBrowser / mobile / tiny GPUs
Qwen 3.5 3B3B3BDense128KLaptop CPU, 4 GB GPUs
Qwen 3.5 7B7B7BDense128KMid-range consumer GPU
Qwen 3.5 9B9B9BDense128KRTX 3060-3080, M-series Macs
Qwen 3.5 14B14B14BDense128KRTX 4070 Ti Super / 4080
Qwen 3.5 32B32B32BDense128KRTX 4090 / 5090 single card
Qwen 3.5 72B72B72BDense128K48 GB+ pro cards or dual 4090
Qwen 3.5 35B-A3B35B3BMoE128KSingle 24 GB GPU with speed
Qwen 3.5 122B-A10B122B10BMoE128KMac Studio 128 GB, 2x 48 GB
Qwen 3.5 397B-A17B397B17BMoE256K8x H100 / 8x H200 cluster

The 9B variant is the most common local target and earns its own walkthrough in Run Qwen 3.5 9B on 64GB RAM, which covers CPU setup, Ollama, and llama.cpp tuning in depth. If you're picking a GPU to run any of these, the best GPU for LLMs benchmarks has measured tok/s across RTX 4060 through H100 that maps directly to the tables below.

Full VRAM Matrix: Every Model, Every Quantization

These figures are weight-file sizes only -- the amount of VRAM the model needs just to load. Add KV cache and framework overhead on top (see the next section). Numbers measured from Hugging Face Qwen model cards GGUF files using llama.cpp and cross-checked against the Unsloth dynamic quants where available.

ModelFP16Q8_0Q6_KQ5_K_MQ4_K_MQ4_K_SQ3_K_MQ2_K
0.5B1.0 GB0.55 GB0.42 GB0.38 GB0.33 GB0.31 GB0.28 GB0.24 GB
1.5B3.1 GB1.65 GB1.28 GB1.10 GB0.95 GB0.90 GB0.78 GB0.65 GB
3B6.2 GB3.30 GB2.56 GB2.20 GB1.88 GB1.78 GB1.55 GB1.28 GB
7B14.2 GB7.55 GB5.85 GB5.04 GB4.32 GB4.08 GB3.55 GB2.95 GB
9B18.1 GB9.60 GB7.45 GB6.40 GB5.50 GB5.20 GB4.52 GB3.75 GB
14B28.4 GB15.10 GB11.70 GB10.05 GB8.65 GB8.15 GB7.10 GB5.90 GB
32B65.0 GB34.50 GB26.80 GB23.00 GB19.70 GB18.60 GB16.15 GB13.40 GB
72B145 GB77.0 GB59.8 GB51.5 GB44.0 GB41.5 GB36.1 GB30.0 GB
35B-A3B (MoE)71.0 GB37.7 GB29.3 GB25.2 GB21.6 GB20.4 GB17.8 GB14.7 GB
122B-A10B (MoE)245 GB130 GB101 GB87 GB74.5 GB70.3 GB61.2 GB50.8 GB
397B-A17B (MoE)798 GB423 GB328 GB283 GB243 GB229 GB199 GB165 GB

Pro tip: Q5_K_M is the best accuracy-per-GB for the 7B-32B range. Perplexity delta versus FP16 stays under 2.5%, and you get roughly 35% more headroom than Q4_K_M for KV cache at long context. Drop to Q4_K_M when the model barely doesn't fit at Q5, and only go below Q4 on the 0.5B-3B tier where smaller quants visibly degrade reasoning.

Context Window VRAM: The Second Budget You Always Forget

The weight file is only half the VRAM story. The KV cache stores attention keys and values for every token in the active context and grows linearly with context length. For the 9B at FP16 KV precision, a 32K context uses roughly 16 GB on its own -- often bigger than the quantized weights. Quantizing the KV cache to 8-bit (--cache-type-k q8_0 --cache-type-v q8_0) halves this at a barely measurable quality cost.

Model4K ctx (KV FP16)8K ctx (KV FP16)32K ctx (KV FP16)128K ctx (KV FP16)32K ctx (KV Q8)
0.5B0.12 GB0.24 GB0.96 GB3.8 GB0.48 GB
1.5B0.25 GB0.50 GB2.0 GB8.0 GB1.0 GB
3B0.50 GB1.0 GB4.0 GB16.0 GB2.0 GB
7B1.0 GB2.0 GB8.0 GB32.0 GB4.0 GB
9B2.0 GB4.0 GB16.0 GB64.0 GB8.0 GB
14B2.5 GB5.0 GB20.0 GB80.0 GB10.0 GB
32B4.0 GB8.0 GB32.0 GB128.0 GB16.0 GB
72B6.0 GB12.0 GB48.0 GB192.0 GB24.0 GB
122B-A10B5.5 GB11.0 GB44.0 GB176.0 GB22.0 GB

Watch out: Qwen 3.5 advertises a 128K or 256K context ceiling, but that figure assumes KV cache can grow unbounded. On a 24 GB RTX 4090 running 14B Q4_K_M, the model itself takes 8.6 GB and 128K context alone wants 80 GB of KV cache -- mathematically impossible. Plan for 8K-32K as the realistic interactive ceiling on single consumer GPUs, and use KV quantization plus sliding-window attention to stretch beyond that. The tokens, context and KV cache explainer covers how the KV cache actually works under the hood.

GPU Recommendations by VRAM Tier

Each tier below assumes Q4_K_M quantization and 8K context headroom. Interactive chat with 4-8K context works comfortably on any of these; longer contexts may require tier-bumping.

TierExample cardsComfortable fitTight fit
4 GBRTX 3050 4GB, GTX 16500.5B FP16, 1.5B Q8_0, 3B Q4_K_M3B Q5_K_M (4K ctx)
8 GBRTX 4060 8GB, 3060 Ti, Arc A7507B Q4_K_M, 9B Q3_K_M9B Q4_K_M with KV Q8
12 GBRTX 3060 12GB, RTX 40709B Q5_K_M 32K, 14B Q4_K_M 8K14B Q5_K_M
16 GBRTX 4070 Ti Super, 408014B Q6_K, 9B FP1632B Q3_K_M only
24 GBRTX 3090, RTX 409032B Q4_K_M 8K, 35B-A3B MoE Q4, 14B FP1632B Q5_K_M with KV Q8
32 GBRTX 509032B Q6_K, 35B-A3B Q5, 14B FP16 32K72B Q2_K only
48 GBRTX 6000 Ada, A6000, 2x 409072B Q4_K_M 4K, 32B FP16, 35B-A3B Q8_072B Q5_K_M
80 GBA100 80GB, H100 80GB72B Q8_0 32K, 122B-A10B MoE Q4122B MoE Q5_K_M
Multi-GPU8x H100, 8x H200397B-A17B FP8 or Q4, 72B FP16 128K-

For concurrent serving, vLLM tensor-parallel across 8x H100 reaches 120-200 tok/s per request on 397B MoE at Q4. The 24 GB tier is the "one GPU to rule them all" for local work -- it unlocks the MoE variants and every dense model up through 32B.

CPU and RAM Fallback: When You Don't Have a GPU

llama.cpp on a modern CPU is viable for the smaller Qwen 3.5 variants. The bottleneck is memory bandwidth, not compute -- which is why DDR5 laptops and Apple Silicon beat older desktop CPUs with fewer cores. Pure CPU inference on 32B+ is technically possible but painfully slow; unified-memory Macs are the practical ceiling.

Model (Q4_K_M)Min RAMRecommended RAMRyzen 9 7950X tok/sM3 Max 48GB tok/sEPYC 9654 tok/s
0.5B2 GB4 GB180 t/s220 t/s260 t/s
1.5B3 GB8 GB95 t/s130 t/s145 t/s
3B4 GB8 GB55 t/s78 t/s82 t/s
7B8 GB16 GB22 t/s36 t/s38 t/s
9B10 GB32 GB17 t/s28 t/s31 t/s
14B16 GB32 GB11 t/s18 t/s22 t/s
32B24 GB64 GB5.5 t/s10 t/s12 t/s
72B48 GB96 GB2.1 t/s4.8 t/s6.2 t/s
35B-A3B (MoE)24 GB48 GB18 t/s29 t/s32 t/s
122B-A10B (MoE)80 GB128 GBOOM14 t/s17 t/s

The MoE speed advantage shows clearly in the last two rows: the 35B-A3B runs as fast as the 7B dense model on the same hardware because only 3B parameters activate per token. If you're on a CPU-only machine, MoE variants are the best bang-per-token you can get. Running LLMs without a GPU has the deeper bench comparison for CPU-first builds.

Watch out: Apple Silicon's unified memory means the M3/M4 Max can allocate up to 75% of total RAM to a single model via the Metal backend, which is why the 128 GB M3 Ultra can host 122B-A10B where no consumer GPU can. But Apple's memory bandwidth peaks at 800 GB/s (M3 Ultra), compared to 3.35 TB/s on an H100 -- so raw tok/s will never match datacenter silicon. It's the only path to big-model local inference on one box that costs under $6K.

Measured Tokens/sec: llama.cpp vs vLLM vs MLX

Framework choice shifts throughput by 30-100%. llama.cpp is fastest for single-user on consumer hardware and M-series Macs; vLLM wins at concurrent batch serving thanks to PagedAttention and continuous batching; MLX roughly matches llama.cpp's Metal backend on Apple Silicon.

Qwen 3.5 9B Q4_K_M, single request, 8K context

FrameworkRTX 3060 12GBRTX 4090A100 80GBM3 Max 48GB
llama.cpp (CUDA)42 t/s118 t/s142 t/s--
vLLM (FP16 equivalent)OOM95 t/s155 t/s--
llama.cpp (Metal)------48 t/s
MLX------52 t/s

Qwen 3.5 32B Q4_K_M, single request, 8K context

FrameworkRTX 4090 24GBRTX 5090 32GBA100 80GBH100 80GB
llama.cpp (CUDA)28 t/s44 t/s52 t/s74 t/s
vLLMTight fit35 t/s68 t/s98 t/s

vLLM concurrent serving gain (batch 32, 9B Q4)

On an A100 80GB, vLLM hits 1,840 tok/s aggregate across 32 concurrent requests; llama.cpp server mode gives ~380 tok/s aggregate on the same hardware. Standing up a real API: vLLM. Tinkering locally: llama.cpp. The Ollama vs vLLM vs llama.cpp comparison has the deeper framework breakdown.

When to Pick MoE Over Dense

MoE variants look like a free lunch: 35B of knowledge that runs as fast as 3B. The catch is the VRAM bill -- you still have to fit all parameters in memory because the router picks experts per token and can touch any of them. Three decision points:

  • Pick 35B-A3B MoE if you have 24 GB+ VRAM (RTX 4090, 3090, 48 GB Macs) and want faster decode than dense 14B. On a 4090 it hits ~65 tok/s at Q4_K_M -- roughly 2x the dense 32B.
  • Pick 122B-A10B MoE if you own a 128 GB M3 Ultra / Mac Studio or can fit 75-85 GB across two 48 GB cards. Outperforms dense 72B on reasoning while generating 40% faster.
  • Stick with dense 9B or 14B if you're on 12-16 GB VRAM. MoE variants don't fit; mid-size dense is the best accuracy you can actually run.

Pro tip: MoE expert offloading in llama.cpp (--override-kv qwen3moe.expert_used_count=int:2) lets you run 397B-A17B on a single 24 GB GPU with 256 GB system RAM. Decode drops to 6-8 tok/s as inactive experts stream over PCIe -- slow, but a real way to touch the frontier model without eight H100s. The advanced offloading flags and long-context tuning I've hit in production I send to the newsletter.

Launch Flags Reference: llama.cpp, vLLM, MLX

Use these as templates; tweak --ctx-size, --threads, and --n-gpu-layers based on your VRAM budget from the matrix above.

llama.cpp (14B Q4_K_M on RTX 4070 12GB)

./llama-server \
  --model ./models/qwen3.5-14b-instruct-q4_k_m.gguf \
  --n-gpu-layers 99 --ctx-size 16384 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --threads 8 --batch-size 512 --flash-attn \
  --host 0.0.0.0 --port 8080

vLLM (production serving, 9B on A100 80GB)

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-9B-Instruct \
  --max-model-len 32768 --gpu-memory-utilization 0.90 \
  --max-num-seqs 64 --quantization awq \
  --host 0.0.0.0 --port 8000

MLX (Apple Silicon, 32B 4-bit on M3 Max 48GB)

pip install mlx-lm
mlx_lm.server --model mlx-community/Qwen3.5-32B-Instruct-4bit \
  --max-tokens 4096 --port 8080

Key flag cheatsheet

Flagllama.cppvLLMWhat it does
Context length--ctx-size--max-model-lenMax tokens in context window
KV cache quant--cache-type-k q8_0--kv-cache-dtype fp8Halves KV VRAM at ~1% quality cost
GPU layers--n-gpu-layers(auto)Partial offload for hybrid CPU/GPU
Concurrent requests--parallel--max-num-seqsHow many requests share KV cache
Memory fraction--main-gpu-fraction--gpu-memory-utilizationShare of VRAM to reserve

For Apple Silicon users, the MLX documentation covers unified-memory tuning and converts Hugging Face checkpoints to MLX format.

Which Qwen 3.5 Should You Actually Run?

After three months across this lineup, my matrix comes down to the VRAM you own:

  • 0.5B or 1.5B: embedded assistants, tab-complete, Raspberry Pi 5. Not for general chat -- they lose multi-step instructions.
  • 3B: 4-6 GB GPU or 16 GB laptop. Competent for RAG retrieval and short-doc summarization.
  • 9B: 12 GB VRAM or 32-64 GB system RAM. The sweet spot -- 80% of GPT-4o-mini quality on dev tasks at zero marginal cost.
  • 14B: 16 GB VRAM. Notably better than 9B at long-form reasoning.
  • 32B dense: 4090/5090. Best single-GPU quality. Slow but smart.
  • 35B-A3B MoE: 24 GB VRAM. 14B-class accuracy at 2x the decode speed.
  • 72B or 122B-A10B MoE: 48-128 GB total memory. Serious local inference; the 128 GB Mac Studio is the best reason to own one for AI in 2026.
  • 397B-A17B: 8x H100 or rented equivalent. Production-only.

If your goal is self-hosting for a team, the self-hosted ChatGPT guide covers the full OpenWebUI + auth + backup stack on top of the Qwen backend.

Common VRAM Gotchas I Hit in Production

  1. KV cache defaults to FP16 and doubles your VRAM budget. llama.cpp's --cache-type-k q8_0 halves it with no perceivable quality drop -- free VRAM.
  2. CUDA allocator fragmentation eats 5-10% VRAM over long runs. Restart the server after heavy traffic or set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for vLLM.
  3. MoE weight files unpack bigger in RAM than on disk. 122B-A10B ships as a 67 GB Q4 GGUF but needs 74 GB in memory -- plan 10-15% headroom.
  4. Dual-GPU tensor parallelism isn't free. Splitting 32B across 2x RTX 4090 via PCIe 4.0 gives you ~60-70% of a single RTX 6000 Ada's throughput, not 100%.
  5. Apple's Metal backend caps at 67% of unified memory by default. sudo sysctl iogpu.wired_limit_mb=101376 on a 128 GB Mac unlocks the full 99 GB for weights. Resets on reboot unless pinned via launchctl.

Frequently Asked Questions

How much VRAM does Qwen 3.5 need?

Qwen 3.5 0.5B needs 0.3-1 GB VRAM at Q4, 9B needs 5.5-6.5 GB, 32B needs 20 GB, and 72B needs 44-50 GB. Add 1-8 GB for the KV cache depending on context. Rule of thumb: take the Q4_K_M weight size in the VRAM matrix and add 2 GB for 8K context.

Can I run Qwen 3.5 on an RTX 4090?

Yes. The 4090's 24 GB VRAM runs 9B at FP16, 14B at Q6_K, 32B at Q4_K_M with 8K context, and 35B-A3B MoE at Q4_K_M. The 72B dense won't fit without CPU offloading; 122B MoE won't fit at all. For most users, the 4090 is the best single-GPU pick for Qwen 3.5.

What is the difference between Qwen 3.5 dense and MoE variants?

Dense models activate every parameter for every token; MoE models load all experts into VRAM but activate only a fraction (A3B means 3B active out of 35B total). MoE decodes faster than dense of the same active size while keeping larger-model knowledge. Trade-off: a 35B MoE still needs ~35B worth of VRAM to load.

Can I run Qwen 3.5 without a GPU?

Yes, up to the 32B dense and 35B-A3B MoE on a modern CPU with 32-64 GB RAM. Expect 17-28 tok/s for 9B on Ryzen 9 / M3 Max, dropping to 2-6 tok/s for 72B. Even a Raspberry Pi 5 runs the 0.5B variant. See the CPU-only inference guide and the 9B walkthrough for setup.

What quantization should I use for Qwen 3.5?

Q4_K_M for best VRAM-per-quality on 8-24 GB GPUs, Q5_K_M when you have extra headroom and want higher accuracy, Q8_0 for production serving where quality matters most. Avoid Q2_K and Q3_K_M on models under 7B -- the perplexity hit becomes visible. For MoE variants, Q4_K_M is the default choice since going lower cuts expert routing quality.

How much VRAM does the Qwen 3.5 KV cache use?

For a 9B model, KV cache at FP16 precision uses roughly 0.5 GB per 1K tokens of context -- so 16 GB for 32K context and 64 GB for 128K context. Quantizing the KV to 8-bit (via llama.cpp's --cache-type-k q8_0) halves this at negligible quality cost. Larger models scale linearly: 14B doubles these numbers, 72B roughly triples them.

Is Qwen 3.5 faster on vLLM or llama.cpp?

vLLM is faster for concurrent requests (3-5x throughput via PagedAttention and continuous batching). llama.cpp is slightly faster for single-user inference on consumer GPUs and significantly better on CPUs and Apple Silicon. Serving an API to multiple users: vLLM. Running locally on your own machine: llama.cpp or Ollama on top of it.

Bottom Line: VRAM Dictates the Model, Everything Else Is Tuning

Look at your VRAM number, find the matching row in the matrix, pick a quantization that leaves 2-8 GB of KV cache headroom, and you're done. The Qwen 3.5 9B at Q4_K_M on a 12 GB card answers the VRAM requirements question for 70% of developers -- it's why that model has its own deep-dive. The 35B-A3B MoE on a 24 GB card unlocks faster decode for another 20%. Above that tier, H100s and Mac Studios are the only honest answer.

Revisit quarterly -- Unsloth and the Qwen team ship new quantizations (including NVFP4 for Blackwell) every few months and the VRAM numbers shift. Re-check before any GPU purchase.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.