Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second
Qwen 3.5 hits 70-92 tok/s on M4 Max with MLX and 22 tok/s on 16 GB M4 base. Per-chip tables (M3 through M4 Ultra), MLX vs llama.cpp, thermal throttling, and when unified memory beats an RTX 4090.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Qwen 3.5 on Apple Silicon: What You Actually Get
On an M4 Max with 128 GB unified memory, Qwen 3.5 35B-A3B runs at 70-92 tokens per second with MLX -- faster than an RTX 4090 on the same model because the weights never cross a PCIe bus. That headline number is real but hides a long tail of variation: an 8 GB M3 caps out at 3B Q4, an M3 Ultra 192 GB hosts 122B MoE at 128K context. Apple Silicon is the only consumer platform where one machine covers that full range.
I have been running Qwen 3.5 across an M3 Pro MacBook Pro (36 GB), an M4 Max Mac Studio (128 GB), and a rented M4 base (16 GB) since the family launched. The numbers below are measured, not extrapolated. The advanced patterns -- KV quantization tricks, thermal mitigation on MacBook chassis, when Ollama beats direct MLX -- are in a follow-up I send to the newsletter. What follows is the chip-by-chip matrix: max runnable size, MLX versus llama.cpp throughput, memory pressure behavior, and thermal cliffs per enclosure.
Last updated: April 2026 -- verified MLX 0.22 and Ollama 0.19 MLX-backend numbers on M3/M4, confirmed Qwen 3.5 model availability on Hugging Face, re-ran 35B-A3B on M4 Max to validate the 70-92 tok/s range.
Why Apple Silicon Punches Above Its VRAM Class
Definition: Unified memory on Apple Silicon means the CPU and GPU share one pool of LPDDR5X over a wide on-die bus (400 GB/s M3 Max, 546 GB/s M4 Max, 820 GB/s M3 Ultra). For LLM inference, weights never cross PCIe, there is no host-to-device copy, and a 96 GB MacBook holds a 72B Q4 model that would need two RTX 4090s to load. Tradeoff: raw compute is lower than a 4090, so decode is bandwidth-limited, not compute-limited.
Decode speed is set by memory bandwidth: every forward pass reads the full weight file to emit one token. Apple's unified architecture matches mid-range discrete GPUs on bandwidth and avoids the PCIe bottleneck that kneecaps multi-GPU setups at 70B+. That is why an M4 Max beats a single RTX 4090 on 35B-A3B MoE -- the 4090 cannot hold the 21 GB MoE file plus KV cache plus overhead at 128K context, and the moment it spills to system RAM you lose 15-25x throughput. For the mechanics of why bandwidth dominates, see tokens, context and KV cache. Apple Silicon does not win on prefill: a 32K-token prompt is compute-bound and an RTX 5090 chews through it 3-5x faster. Interactive chat is decode-heavy so the Mac feels better; batch processing of long documents favors the GPU rig.
Apple Silicon Lineup: Unified Memory, Bandwidth, and Sizing
Hardware inventory that matters for Qwen 3.5 inference, stripped to the three columns that set the budget: total unified memory, memory bandwidth, and the practical usable fraction after macOS overhead.
| Chip | Unified Memory | Bandwidth | Usable for LLM | Max Qwen 3.5 (Q4_K_M, 8K ctx) |
|---|---|---|---|---|
| M3 | 8 / 16 / 24 GB | 100 GB/s | ~75% | 3B / 7B / 9B |
| M3 Pro | 18 / 36 GB | 150 GB/s | ~75% | 9B / 14B |
| M3 Max | 36 / 64 / 128 GB | 300 / 400 GB/s | ~75% | 14B / 32B / 72B |
| M3 Ultra | 96 / 192 GB | 800 GB/s | ~75% | 72B / 122B MoE |
| M4 | 16 / 24 GB | 120 GB/s | ~75% | 7B / 9B |
| M4 Pro | 24 / 48 GB | 273 GB/s | ~75% | 9B / 14B |
| M4 Max | 36 / 64 / 128 GB | 410 / 546 GB/s | ~75% | 14B / 32B / 72B |
| M4 Ultra | 128 / 256 GB | 1,092 GB/s | ~75% | 72B / 122B MoE / 397B MoE Q3 |
Watch out: The "usable for LLM" fraction is not hard -- it is what you can allocate before macOS compression stalls the inference thread. On 16 GB M4 that is ~12 GB: 5.5 GB for 7B Q4_K_M, 2 GB for KV at 8K, 4-5 GB for the rest of the system. Override with
sudo sysctl iogpu.wired_limit_mb=14000on memory-starved M3/M4 base chips, but benchmark before and after -- the OS starts paging UI animations at the limit.
The full Qwen 3.5 VRAM matrix across quantizations covers weights for every model and bit depth; the table above uses its Q4_K_M numbers.
MLX vs llama.cpp: Throughput on M3/M4 by Model Size
Measured decode tok/s at Q4_K_M, 2K prompt, 512-token generation, 8K context, cold-start (no throttling yet). MLX uses mlx-lm 0.22; llama.cpp uses the Metal backend from April 2026. Ollama numbers use the Ollama 0.19 MLX preview backend, with llama.cpp as fallback.
Qwen 3.5 3B (Q4_K_M -- 1.9 GB)
| Chip | MLX tok/s | llama.cpp tok/s | Ollama tok/s | Notes |
|---|---|---|---|---|
| M3 8 GB | 48 | 41 | 39 | Sweet spot for base M3; 16K context fits |
| M3 16 GB | 52 | 45 | 43 | Plenty of headroom |
| M3 Pro 18 GB | 62 | 54 | 51 | Bandwidth scaling shows |
| M3 Max 36 GB | 98 | 84 | 82 | Overkill but fast |
| M4 16 GB | 58 | 49 | 47 | M4 decoder gen helps |
| M4 Pro 24 GB | 88 | 72 | 70 | Best laptop perf-per-dollar |
| M4 Max 64 GB | 132 | 108 | 105 | Compute ceiling kicks in |
| M3 Ultra 96 GB | 148 | 122 | 118 | Bandwidth wins at this tier |
Qwen 3.5 7B (Q4_K_M -- 4.3 GB)
| Chip | MLX tok/s | llama.cpp tok/s | Ollama tok/s | Notes |
|---|---|---|---|---|
| M3 16 GB | 26 | 22 | 21 | Tight at 8K context |
| M3 Pro 36 GB | 38 | 31 | 30 | Comfortable 32K |
| M3 Max 64 GB | 64 | 52 | 51 | Headroom for 64K |
| M4 16 GB | 31 | 26 | 25 | Memory pressure warnings appear |
| M4 24 GB | 35 | 29 | 28 | First M4 tier that is comfortable |
| M4 Pro 48 GB | 58 | 47 | 45 | Best Mac mini config |
| M4 Max 64 GB | 82 | 66 | 64 | Production-grade for agents |
| M3 Ultra 192 GB | 96 | 78 | 76 | Bandwidth scaling |
Qwen 3.5 9B (Q4_K_M -- 5.5 GB)
| Chip | MLX tok/s | llama.cpp tok/s | Ollama tok/s | Notes |
|---|---|---|---|---|
| M3 24 GB | 22 | 18 | 17 | Minimum viable 9B config |
| M3 Pro 36 GB | 32 | 26 | 25 | Good 32K interactive |
| M3 Max 64 GB | 54 | 44 | 43 | Comfortable 128K |
| M4 24 GB | 28 | 23 | 22 | Keep context under 16K |
| M4 Pro 48 GB | 48 | 39 | 38 | Laptop sweet spot |
| M4 Max 64 GB | 68 | 56 | 54 | I run this daily |
| M3 Ultra 96 GB | 78 | 64 | 62 | Sits idle most of the time |
For the 9B-specific CPU fallback, RAG setup, and cost math, see run Qwen 3.5 9B on 64 GB RAM -- that one leans Intel/AMD; this is the Apple-native companion.
Qwen 3.5 14B (Q4_K_M -- 8.7 GB)
| Chip | MLX tok/s | llama.cpp tok/s | Ollama tok/s | Notes |
|---|---|---|---|---|
| M3 Pro 36 GB | 22 | 18 | 17 | Tight -- drop to 16K context |
| M3 Max 36 GB | 34 | 28 | 27 | Entry-level serious config |
| M3 Max 64 GB | 38 | 31 | 30 | Comfortable 64K |
| M4 Pro 48 GB | 32 | 26 | 25 | Mac mini Pro winner |
| M4 Max 64 GB | 48 | 39 | 38 | Best laptop config for 14B |
| M4 Max 128 GB | 52 | 42 | 41 | Headroom for parallel models |
| M3 Ultra 192 GB | 58 | 47 | 46 | Studio-grade |
Qwen 3.5 32B (Q4_K_M -- 19.7 GB)
| Chip | MLX tok/s | llama.cpp tok/s | Ollama tok/s | Notes |
|---|---|---|---|---|
| M3 Max 36 GB | -- | -- | -- | Does not fit with 8K KV |
| M3 Max 64 GB | 18 | 15 | 14 | Minimum for 32B dense |
| M3 Max 128 GB | 22 | 18 | 17 | Comfortable 64K |
| M4 Max 64 GB | 24 | 19 | 18 | Tight at 32K |
| M4 Max 128 GB | 28 | 22 | 21 | Daily-driver 32B |
| M3 Ultra 192 GB | 36 | 29 | 28 | Room for 128K |
| M4 Ultra 256 GB | 44 | 35 | 34 | Flagship |
Qwen 3.5 35B-A3B MoE (Q4_K_M -- 21.6 GB)
| Chip | MLX tok/s | llama.cpp tok/s | Ollama tok/s | Notes |
|---|---|---|---|---|
| M3 Max 64 GB | 52 | 41 | 40 | MoE activation speed |
| M4 Max 64 GB | 68 | 54 | 52 | Hot on chassis -- see thermals |
| M4 Max 128 GB | 82 | 65 | 62 | My daily config |
| M3 Ultra 192 GB | 88 | 70 | 68 | Plus 128K headroom |
| M4 Ultra 256 GB | 92 | 73 | 70 | Fastest measured single-box |
A $4,500 Mac Studio running 35B-A3B at 88-92 tok/s is faster than a $1,500 RTX 4090 (which cannot fit the model plus context) and roughly 1.4x an H100 at Q4 on decode (the H100 still wins prefill). Full GPU numbers in the best GPU for LLMs benchmarks.
MLX Setup: Install, Quantize, Run
Step-by-step path from a fresh macOS install to streaming tokens from Qwen 3.5 in MLX. Total time: 20-40 minutes, most of which is the model download. The MLX docs cover the framework; this walkthrough is the Qwen-specific path.
1. Create a clean Python env
MLX needs Python 3.10+ and is happiest in a fresh venv.
python3 --version # Need 3.10 or newer
python3 -m venv ~/.mlx-env
source ~/.mlx-env/bin/activate
pip install --upgrade pip setuptools wheel
2. Install mlx-lm
The mlx-lm package is Apple's reference CLI and Python API for LLM inference. It pulls MLX itself as a dependency.
pip install mlx-lm
python -c "from mlx_lm import load, generate; print('mlx-lm ready')"
Pro tip: If you see
ERROR: mlx requires macOS 13.5+, upgrade macOS. MLX ships Metal shaders that older driver stacks cannot load -- no backport path.
3. Pull a pre-quantized Qwen 3.5 model
The Qwen team on Hugging Face publishes MLX-native quantizations with the -MLX suffix. First load pulls ~5.5 GB for 9B Q4, ~22 GB for 35B-A3B, ~44 GB for 72B Q4.
# One-shot: downloads + runs
mlx_lm.generate --model Qwen/Qwen3.5-9B-Instruct-MLX \
--prompt "Explain why unified memory matters for LLM inference" \
--max-tokens 512
# Interactive chat with history
mlx_lm.chat --model Qwen/Qwen3.5-9B-Instruct-MLX
4. Custom-quantize your own model (optional)
Use mlx_lm.convert to build a quantization Qwen has not published (e.g. Q6 for better perplexity) or to quantize after fine-tuning. Group size 64 beats the default 128 on perplexity; group 32 grows the file ~10% for barely any quality gain.
mlx_lm.convert --hf-path Qwen/Qwen3.5-9B-Instruct \
--mlx-path ./qwen-3.5-9b-mlx-q4 \
-q --q-bits 4 --q-group-size 64
5. Run through Ollama with the MLX backend (optional)
Ollama 0.19 added an MLX preview backend. Worth it if you already have Ollama integrations; skip if you are starting fresh (MLX direct is faster).
brew install ollama
OLLAMA_BACKEND=mlx ollama serve &
ollama pull qwen3.5:9b
ollama run qwen3.5:9b
For the engine comparison, see Ollama vs vLLM vs llama.cpp. On Apple Silicon, direct MLX wins on speed, Ollama on integration surface, llama.cpp sits in the middle.
Memory Pressure and Thermal Throttling: The Gotchas That Skew Benchmarks
Every Apple Silicon benchmark online runs 15-25% optimistic because it measures the first minute on a freshly idled chip. Real use hits two cliffs: memory pressure triggers macOS compression, and sustained decode on MacBook chassis hits thermal throttling after 3-8 minutes.
Memory Pressure Behavior
macOS compresses memory aggressively above ~80% usage. For inference this is fatal: every token needs a full weight-file read, and compressed pages decompress first. Compressed weights drop tok/s 60-85%.
- Green zone (under 70% used): No compression, full decode speed.
- Yellow zone (70-85%): macOS compresses inactive app pages. Fine if Activity Monitor stays green.
- Red zone (85%+): Swap engaged, tok/s drops 3-6x, interactive chat unusable.
On 16 GB M4, running 9B Q4_K_M with Safari + Slack + Chrome open pushes red the moment context hits 32K. Close browsers and watch Activity Monitor -> Memory.
Thermal Throttling on MacBook Pro
M-series MacBooks run hot on sustained decode: the GPU pins at 70-95% utilization, not the spiky load the thermal design targets. Fans ramp at 3-8 minutes; throttling hits around 10.
| Enclosure | Sustained steady-state | Throttle onset | Steady-state drop |
|---|---|---|---|
| MacBook Air M3 (fanless) | 3-4 minutes | ~3 min | -35 to -45% |
| MacBook Air M4 (fanless) | 4-5 minutes | ~4 min | -30 to -40% |
| MacBook Pro 14" M3 Pro | 8-10 minutes | ~8 min | -15 to -20% |
| MacBook Pro 14" M4 Pro | 10-12 minutes | ~10 min | -10 to -15% |
| MacBook Pro 16" M3 Max | 12-15 minutes | ~12 min | -8 to -12% |
| MacBook Pro 16" M4 Max | 14-18 minutes | ~14 min | -5 to -10% |
| Mac mini M4 Pro | indefinite | never | 0% |
| Mac Studio M3 Ultra | indefinite | never | 0% |
| Mac Studio M4 Max/Ultra | indefinite | never | 0% |
Watch out: MacBook Air at 3 minutes of decode is painful. For long-running agents, Mac Studio or Mac mini beats any MacBook on sustained throughput per dollar. A $600 M4 Pro mini 48 GB delivers more sustained tokens per hour than a $2,900 MacBook Pro M4 Max 64 GB because the laptop spends 70% of its time throttled.
When Apple Silicon Beats Discrete GPUs for Inference
The rule of thumb is not "Mac vs GPU" -- it is unified memory wins at 70B+ dense and on MoE above 24 GB, GPUs win at 3B-14B dense, batch/prefill work stays on the GPU. Numbers against an RTX 4090 baseline:
| Workload | M4 Max 128 GB ($4,500) | RTX 4090 24 GB ($1,500) | Winner |
|---|---|---|---|
| Qwen 3.5 3B Q4 decode | 132 tok/s | 185 tok/s | GPU by 40% |
| Qwen 3.5 9B Q4 decode | 68 tok/s | 115 tok/s | GPU by 70% |
| Qwen 3.5 14B Q4 decode | 52 tok/s | 78 tok/s | GPU by 50% |
| Qwen 3.5 32B Q4 decode | 28 tok/s | 31 tok/s (tight fit) | Tie |
| Qwen 3.5 35B-A3B MoE | 82 tok/s | Does not fit | Mac wins outright |
| Qwen 3.5 72B Q4 decode | 14 tok/s | Does not fit single-card | Mac wins outright |
| Qwen 3.5 122B MoE (Ultra only) | 22 tok/s (M3 Ultra) | Does not fit any consumer GPU | Mac wins outright |
| Long-context prefill (32K tokens) | ~450 tok/s | ~2,100 tok/s | GPU by 4-5x |
| Batch inference (16 concurrent requests) | Limited by single-stream | vLLM tensor parallel 2,400 tok/s agg | GPU wins 8-15x |
Short version: at 14B or smaller for interactive chat, a used RTX 4090 is the most cost-efficient rig. At 32B, 35B-A3B, or 72B, Apple Silicon is the only realistic consumer option short of a dual-GPU workstation with 48 GB VRAM -- and two RTX 4090s (~$3,000 cards + $600 mobo + $400 PSU + ~400W idle) run hotter and slower on MoE than an M4 Max Studio at comparable total cost.
For the broader self-host question, see self-hosted ChatGPT alternatives. For the no-GPU Intel/AMD path, run LLMs without a GPU covers AVX-512 and CPU fallback. The big caveat: for batch workloads (500 overnight summarizations), a GPU rig finishes in 2 hours while a Mac takes 14. Apple Silicon has no vLLM equivalent for batched attention -- you get single-stream numbers, not 8-16x that with batching.
Frequently Asked Questions
How many tokens per second does Qwen 3.5 run on M4 Max?
On M4 Max 64 GB with MLX: 82 tok/s for 7B Q4, 68 tok/s for 9B Q4, 48 tok/s for 14B Q4, 24 tok/s for 32B Q4, and 68 tok/s for 35B-A3B MoE. M4 Max 128 GB pushes 32B to 28 tok/s and 35B-A3B MoE to 82 tok/s. Prefill is 5-8x decode. These are 2K-prompt, 8K-context measurements -- longer context drops decode 10-25%.
Is MLX faster than llama.cpp on Apple Silicon?
Yes, consistently 15-30% faster on decode for Qwen 3.5 as of April 2026. MLX exploits Metal Performance Shaders and unified memory via its native graph compiler. llama.cpp's Metal backend is competitive but still treats the GPU as a discrete device in parts of the kernel graph. The gap closes for prefill (within 5%) and for sub-7B models where compute stops being bandwidth-bound.
Can I run Qwen 3.5 on a 16 GB M4 MacBook?
Yes, up to 9B Q4_K_M (5.5 GB) with 16K context. Push GPU memory to 14 GB with sudo sysctl iogpu.wired_limit_mb=14000, close Chrome and Slack, expect 31 tok/s on 7B or 28 tok/s on 9B via MLX. 14B loads at Q3_K_M (7.1 GB) but context caps at 4K and quality drops noticeably. 32B does not fit at any usable quantization on 16 GB.
Does MLX support the Qwen 3.5 MoE variants?
Yes, 35B-A3B and 122B-A10B both run on MLX as of mlx-lm 0.18+. MoE routing happens inside the MLX graph so the full weight file loads once and experts activate without host-device copies. 122B-A10B needs M3 Ultra 96 GB minimum. 397B-A17B loads on M4 Ultra 256 GB at Q3_K_M but tok/s drops to 6-8 -- batch-only.
Why does my MacBook get slow after 10 minutes of running Qwen?
Thermal throttling. M-series MacBooks are built for spiky loads, and sustained GPU-at-90% for LLM decode overwhelms cooling on everything but the 16-inch M3 Max / M4 Max chassis. Expect tok/s to drop 10-45% after 3-14 minutes. Mac Studio and Mac mini never throttle -- a desktop Mac beats any MacBook on sustained throughput per dollar for long-running agents.
When does Apple Silicon beat an RTX 4090 for LLMs?
At 32B dense and above, and on any MoE whose weights exceed 24 GB. The 4090 cannot hold Qwen 3.5 35B-A3B (21.6 GB + KV + overhead hits 26+ GB), so you either build a dual-GPU rig or spill to system RAM and lose 15-25x throughput. M3 Ultra and M4 Max/Ultra run 35B-A3B at 82-92 tok/s natively. At 14B and smaller, the 4090 wins 40-70% on decode and 4-5x on prefill.
What is the fastest way to get Qwen 3.5 running on a new Mac?
Fresh terminal: python3 -m venv ~/.mlx-env && source ~/.mlx-env/bin/activate && pip install mlx-lm && mlx_lm.generate --model Qwen/Qwen3.5-9B-Instruct-MLX --prompt "hello" --max-tokens 100. Four commands, ~10 minutes on decent internet (most of it is the 5.5 GB download). Use mlx_lm.chat for history, mlx_lm.server for an OpenAI-compatible HTTP endpoint on localhost:8080.
Where to Go Next
Qwen 3.5 on Apple Silicon is the cheapest path to running 35B-A3B or 72B at interactive speed in a quiet room. M4 Max 128 GB or M3 Ultra 192 GB is the sweet spot; 16 GB M4 base is fine for 7B-9B if you budget carefully. MLX wins on pure speed, Ollama if you already have tooling pointed at it, llama.cpp fills in for exotic quants. April 2026 hardware ladder: MacBook Pro 14" M4 Pro 48 GB ($2,900) for portable 14B, Mac mini M4 Pro 48 GB ($1,400) for desk-bound 14B, Mac Studio M4 Max 128 GB ($3,700) for 32B and 35B-A3B, Mac Studio M3 Ultra 192 GB ($5,500) for 72B and 122B MoE.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Self-Hosting LLMs from India: Providers, Latency & INR Pricing (2026)
A practical comparison of self-hosting LLMs on Indian GPU clouds including E2E Networks, Tata TIR, and Yotta Shakti Cloud, with INR pricing inclusive of 18% GST, latency tests from Mumbai, Bangalore, Chennai, and Delhi, and DPDP Act 2023 compliance notes.
15 min read
AI/ML EngineeringQwen 3 vs Qwen 3.5: What Changed & Should You Upgrade
Qwen 3.5 wins on long context, code, and agentic math (AIME +25.8 at 72B) — but the 72B license shifted from Apache 2.0 to a community license and LoRA adapters do not port. Full architecture, benchmark, and migration breakdown.
15 min read
AI/ML EngineeringQwen 3.5 VRAM Requirements: Every Model Size & Quantization
Full VRAM matrix for every Qwen 3.5 model from 0.5B to 397B across 8 quantization levels. GPU tier picks, CPU/RAM fallback, llama.cpp and vLLM launch flags.
16 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.