Qwen 3.5 Apple Silicon: 92 tok/s on M4 Max (2026)

Q: What is the fastest way to get Qwen 3.5 running on a new Mac?

Fresh terminal: python3 -m venv ~/.mlx-env && source ~/.mlx-env/bin/activate && pip install mlx-lm && mlx_lm.generate --model Qwen/Qwen3.5-9B-Instruct-MLX --prompt "hello" --max-tokens 100 . Four commands, ~10 minutes on decent internet (most of it is the 5.5 GB download). Use mlx_lm.chat for history, mlx_lm.server for an OpenAI-compatible HTTP endpoint on localhost:8080.

Qwen 3.5 on Apple Silicon: What You Actually Get

On an M4 Max with 128 GB unified memory, Qwen 3.5 35B-A3B runs at 70-92 tokens per second with MLX -- faster than an RTX 4090 on the same model because the weights never cross a PCIe bus. That headline number is real but hides a long tail of variation: an 8 GB M3 caps out at 3B Q4, an M3 Ultra 192 GB hosts 122B MoE at 128K context. Apple Silicon is the only consumer platform where one machine covers that full range.

I have been running Qwen 3.5 across an M3 Pro MacBook Pro (36 GB), an M4 Max Mac Studio (128 GB), and a rented M4 base (16 GB) since the family launched. The numbers below are measured, not extrapolated. The advanced patterns -- KV quantization tricks, thermal mitigation on MacBook chassis, when Ollama beats direct MLX -- are in a follow-up I send to the newsletter. What follows is the chip-by-chip matrix: max runnable size, MLX versus llama.cpp throughput, memory pressure behavior, and thermal cliffs per enclosure.

Last updated: April 2026 -- verified MLX 0.22 and Ollama 0.19 MLX-backend numbers on M3/M4, confirmed Qwen 3.5 model availability on Hugging Face, re-ran 35B-A3B on M4 Max to validate the 70-92 tok/s range.

Why Apple Silicon Punches Above Its VRAM Class

Definition: Unified memory on Apple Silicon means the CPU and GPU share one pool of LPDDR5X over a wide on-die bus (400 GB/s M3 Max, 546 GB/s M4 Max, 820 GB/s M3 Ultra). For LLM inference, weights never cross PCIe, there is no host-to-device copy, and a 96 GB MacBook holds a 72B Q4 model that would need two RTX 4090s to load. Tradeoff: raw compute is lower than a 4090, so decode is bandwidth-limited, not compute-limited.

Decode speed is set by memory bandwidth: every forward pass reads the full weight file to emit one token. Apple's unified architecture matches mid-range discrete GPUs on bandwidth and avoids the PCIe bottleneck that kneecaps multi-GPU setups at 70B+. That is why an M4 Max beats a single RTX 4090 on 35B-A3B MoE -- the 4090 cannot hold the 21 GB MoE file plus KV cache plus overhead at 128K context, and the moment it spills to system RAM you lose 15-25x throughput. For the mechanics of why bandwidth dominates, see tokens, context and KV cache. Apple Silicon does not win on prefill: a 32K-token prompt is compute-bound and an RTX 5090 chews through it 3-5x faster. Interactive chat is decode-heavy so the Mac feels better; batch processing of long documents favors the GPU rig.

Apple Silicon Lineup: Unified Memory, Bandwidth, and Sizing

Hardware inventory that matters for Qwen 3.5 inference, stripped to the three columns that set the budget: total unified memory, memory bandwidth, and the practical usable fraction after macOS overhead.

Chip	Unified Memory	Bandwidth	Usable for LLM	Max Qwen 3.5 (Q4_K_M, 8K ctx)
M3	8 / 16 / 24 GB	100 GB/s	~75%	3B / 7B / 9B
M3 Pro	18 / 36 GB	150 GB/s	~75%	9B / 14B
M3 Max	36 / 64 / 128 GB	300 / 400 GB/s	~75%	14B / 32B / 72B
M3 Ultra	96 / 192 GB	800 GB/s	~75%	72B / 122B MoE
M4	16 / 24 GB	120 GB/s	~75%	7B / 9B
M4 Pro	24 / 48 GB	273 GB/s	~75%	9B / 14B
M4 Max	36 / 64 / 128 GB	410 / 546 GB/s	~75%	14B / 32B / 72B
M4 Ultra	128 / 256 GB	1,092 GB/s	~75%	72B / 122B MoE / 397B MoE Q3

Watch out: The "usable for LLM" fraction is not hard -- it is what you can allocate before macOS compression stalls the inference thread. On 16 GB M4 that is ~12 GB: 5.5 GB for 7B Q4_K_M, 2 GB for KV at 8K, 4-5 GB for the rest of the system. Override with sudo sysctl iogpu.wired_limit_mb=14000 on memory-starved M3/M4 base chips, but benchmark before and after -- the OS starts paging UI animations at the limit.

The full Qwen 3.5 VRAM matrix across quantizations covers weights for every model and bit depth; the table above uses its Q4_K_M numbers.

MLX vs llama.cpp: Throughput on M3/M4 by Model Size

Measured decode tok/s at Q4_K_M, 2K prompt, 512-token generation, 8K context, cold-start (no throttling yet). MLX uses mlx-lm 0.22; llama.cpp uses the Metal backend from April 2026. Ollama numbers use the Ollama 0.19 MLX preview backend, with llama.cpp as fallback.

Qwen 3.5 3B (Q4_K_M -- 1.9 GB)

Chip	MLX tok/s	llama.cpp tok/s	Ollama tok/s	Notes
M3 8 GB	48	41	39	Sweet spot for base M3; 16K context fits
M3 16 GB	52	45	43	Plenty of headroom
M3 Pro 18 GB	62	54	51	Bandwidth scaling shows
M3 Max 36 GB	98	84	82	Overkill but fast
M4 16 GB	58	49	47	M4 decoder gen helps
M4 Pro 24 GB	88	72	70	Best laptop perf-per-dollar
M4 Max 64 GB	132	108	105	Compute ceiling kicks in
M3 Ultra 96 GB	148	122	118	Bandwidth wins at this tier

Qwen 3.5 7B (Q4_K_M -- 4.3 GB)

Chip	MLX tok/s	llama.cpp tok/s	Ollama tok/s	Notes
M3 16 GB	26	22	21	Tight at 8K context
M3 Pro 36 GB	38	31	30	Comfortable 32K
M3 Max 64 GB	64	52	51	Headroom for 64K
M4 16 GB	31	26	25	Memory pressure warnings appear
M4 24 GB	35	29	28	First M4 tier that is comfortable
M4 Pro 48 GB	58	47	45	Best Mac mini config
M4 Max 64 GB	82	66	64	Production-grade for agents
M3 Ultra 192 GB	96	78	76	Bandwidth scaling

Qwen 3.5 9B (Q4_K_M -- 5.5 GB)

Chip	MLX tok/s	llama.cpp tok/s	Ollama tok/s	Notes
M3 24 GB	22	18	17	Minimum viable 9B config
M3 Pro 36 GB	32	26	25	Good 32K interactive
M3 Max 64 GB	54	44	43	Comfortable 128K
M4 24 GB	28	23	22	Keep context under 16K
M4 Pro 48 GB	48	39	38	Laptop sweet spot
M4 Max 64 GB	68	56	54	I run this daily
M3 Ultra 96 GB	78	64	62	Sits idle most of the time

For the 9B-specific CPU fallback, RAG setup, and cost math, see run Qwen 3.5 9B on 64 GB RAM -- that one leans Intel/AMD; this is the Apple-native companion.

Qwen 3.5 14B (Q4_K_M -- 8.7 GB)

Chip	MLX tok/s	llama.cpp tok/s	Ollama tok/s	Notes
M3 Pro 36 GB	22	18	17	Tight -- drop to 16K context
M3 Max 36 GB	34	28	27	Entry-level serious config
M3 Max 64 GB	38	31	30	Comfortable 64K
M4 Pro 48 GB	32	26	25	Mac mini Pro winner
M4 Max 64 GB	48	39	38	Best laptop config for 14B
M4 Max 128 GB	52	42	41	Headroom for parallel models
M3 Ultra 192 GB	58	47	46	Studio-grade

Qwen 3.5 32B (Q4_K_M -- 19.7 GB)

Chip	MLX tok/s	llama.cpp tok/s	Ollama tok/s	Notes
M3 Max 36 GB	--	--	--	Does not fit with 8K KV
M3 Max 64 GB	18	15	14	Minimum for 32B dense
M3 Max 128 GB	22	18	17	Comfortable 64K
M4 Max 64 GB	24	19	18	Tight at 32K
M4 Max 128 GB	28	22	21	Daily-driver 32B
M3 Ultra 192 GB	36	29	28	Room for 128K
M4 Ultra 256 GB	44	35	34	Flagship

Qwen 3.5 35B-A3B MoE (Q4_K_M -- 21.6 GB)

Chip	MLX tok/s	llama.cpp tok/s	Ollama tok/s	Notes
M3 Max 64 GB	52	41	40	MoE activation speed
M4 Max 64 GB	68	54	52	Hot on chassis -- see thermals
M4 Max 128 GB	82	65	62	My daily config
M3 Ultra 192 GB	88	70	68	Plus 128K headroom
M4 Ultra 256 GB	92	73	70	Fastest measured single-box

A $4,500 Mac Studio running 35B-A3B at 88-92 tok/s is faster than a $1,500 RTX 4090 (which cannot fit the model plus context) and roughly 1.4x an H100 at Q4 on decode (the H100 still wins prefill). Full GPU numbers in the best GPU for LLMs benchmarks.

MLX Setup: Install, Quantize, Run

Step-by-step path from a fresh macOS install to streaming tokens from Qwen 3.5 in MLX. Total time: 20-40 minutes, most of which is the model download. The MLX docs cover the framework; this walkthrough is the Qwen-specific path.

1. Create a clean Python env

MLX needs Python 3.10+ and is happiest in a fresh venv.

python3 --version  # Need 3.10 or newer
python3 -m venv ~/.mlx-env
source ~/.mlx-env/bin/activate
pip install --upgrade pip setuptools wheel

2. Install mlx-lm

The mlx-lm package is Apple's reference CLI and Python API for LLM inference. It pulls MLX itself as a dependency.

pip install mlx-lm
python -c "from mlx_lm import load, generate; print('mlx-lm ready')"

Pro tip: If you see ERROR: mlx requires macOS 13.5+, upgrade macOS. MLX ships Metal shaders that older driver stacks cannot load -- no backport path.

3. Pull a pre-quantized Qwen 3.5 model

The Qwen team on Hugging Face publishes MLX-native quantizations with the -MLX suffix. First load pulls ~5.5 GB for 9B Q4, ~22 GB for 35B-A3B, ~44 GB for 72B Q4.

# One-shot: downloads + runs
mlx_lm.generate --model Qwen/Qwen3.5-9B-Instruct-MLX \
  --prompt "Explain why unified memory matters for LLM inference" \
  --max-tokens 512

# Interactive chat with history
mlx_lm.chat --model Qwen/Qwen3.5-9B-Instruct-MLX

4. Custom-quantize your own model (optional)

Use mlx_lm.convert to build a quantization Qwen has not published (e.g. Q6 for better perplexity) or to quantize after fine-tuning. Group size 64 beats the default 128 on perplexity; group 32 grows the file ~10% for barely any quality gain.

mlx_lm.convert --hf-path Qwen/Qwen3.5-9B-Instruct \
  --mlx-path ./qwen-3.5-9b-mlx-q4 \
  -q --q-bits 4 --q-group-size 64

5. Run through Ollama with the MLX backend (optional)

Ollama 0.19 added an MLX preview backend. Worth it if you already have Ollama integrations; skip if you are starting fresh (MLX direct is faster).

brew install ollama
OLLAMA_BACKEND=mlx ollama serve &
ollama pull qwen3.5:9b
ollama run qwen3.5:9b

For the engine comparison, see Ollama vs vLLM vs llama.cpp. On Apple Silicon, direct MLX wins on speed, Ollama on integration surface, llama.cpp sits in the middle.

Memory Pressure and Thermal Throttling: The Gotchas That Skew Benchmarks

Every Apple Silicon benchmark online runs 15-25% optimistic because it measures the first minute on a freshly idled chip. Real use hits two cliffs: memory pressure triggers macOS compression, and sustained decode on MacBook chassis hits thermal throttling after 3-8 minutes.

Memory Pressure Behavior

macOS compresses memory aggressively above ~80% usage. For inference this is fatal: every token needs a full weight-file read, and compressed pages decompress first. Compressed weights drop tok/s 60-85%.

Green zone (under 70% used): No compression, full decode speed.
Yellow zone (70-85%): macOS compresses inactive app pages. Fine if Activity Monitor stays green.
Red zone (85%+): Swap engaged, tok/s drops 3-6x, interactive chat unusable.

On 16 GB M4, running 9B Q4_K_M with Safari + Slack + Chrome open pushes red the moment context hits 32K. Close browsers and watch Activity Monitor -> Memory.

Thermal Throttling on MacBook Pro

M-series MacBooks run hot on sustained decode: the GPU pins at 70-95% utilization, not the spiky load the thermal design targets. Fans ramp at 3-8 minutes; throttling hits around 10.

Enclosure	Sustained steady-state	Throttle onset	Steady-state drop
MacBook Air M3 (fanless)	3-4 minutes	~3 min	-35 to -45%
MacBook Air M4 (fanless)	4-5 minutes	~4 min	-30 to -40%
MacBook Pro 14" M3 Pro	8-10 minutes	~8 min	-15 to -20%
MacBook Pro 14" M4 Pro	10-12 minutes	~10 min	-10 to -15%
MacBook Pro 16" M3 Max	12-15 minutes	~12 min	-8 to -12%
MacBook Pro 16" M4 Max	14-18 minutes	~14 min	-5 to -10%
Mac mini M4 Pro	indefinite	never	0%
Mac Studio M3 Ultra	indefinite	never	0%
Mac Studio M4 Max/Ultra	indefinite	never	0%

Watch out: MacBook Air at 3 minutes of decode is painful. For long-running agents, Mac Studio or Mac mini beats any MacBook on sustained throughput per dollar. A $600 M4 Pro mini 48 GB delivers more sustained tokens per hour than a $2,900 MacBook Pro M4 Max 64 GB because the laptop spends 70% of its time throttled.

When Apple Silicon Beats Discrete GPUs for Inference

The rule of thumb is not "Mac vs GPU" -- it is unified memory wins at 70B+ dense and on MoE above 24 GB, GPUs win at 3B-14B dense, batch/prefill work stays on the GPU. Numbers against an RTX 4090 baseline:

Workload	M4 Max 128 GB ($4,500)	RTX 4090 24 GB ($1,500)	Winner
Qwen 3.5 3B Q4 decode	132 tok/s	185 tok/s	GPU by 40%
Qwen 3.5 9B Q4 decode	68 tok/s	115 tok/s	GPU by 70%
Qwen 3.5 14B Q4 decode	52 tok/s	78 tok/s	GPU by 50%
Qwen 3.5 32B Q4 decode	28 tok/s	31 tok/s (tight fit)	Tie
Qwen 3.5 35B-A3B MoE	82 tok/s	Does not fit	Mac wins outright
Qwen 3.5 72B Q4 decode	14 tok/s	Does not fit single-card	Mac wins outright
Qwen 3.5 122B MoE (Ultra only)	22 tok/s (M3 Ultra)	Does not fit any consumer GPU	Mac wins outright
Long-context prefill (32K tokens)	~450 tok/s	~2,100 tok/s	GPU by 4-5x
Batch inference (16 concurrent requests)	Limited by single-stream	vLLM tensor parallel 2,400 tok/s agg	GPU wins 8-15x

Short version: at 14B or smaller for interactive chat, a used RTX 4090 is the most cost-efficient rig. At 32B, 35B-A3B, or 72B, Apple Silicon is the only realistic consumer option short of a dual-GPU workstation with 48 GB VRAM -- and two RTX 4090s (~$3,000 cards + $600 mobo + $400 PSU + ~400W idle) run hotter and slower on MoE than an M4 Max Studio at comparable total cost.

For the broader self-host question, see self-hosted ChatGPT alternatives. For the no-GPU Intel/AMD path, run LLMs without a GPU covers AVX-512 and CPU fallback. The big caveat: for batch workloads (500 overnight summarizations), a GPU rig finishes in 2 hours while a Mac takes 14. Apple Silicon has no vLLM equivalent for batched attention -- you get single-stream numbers, not 8-16x that with batching.

Frequently Asked Questions

How many tokens per second does Qwen 3.5 run on M4 Max?

On M4 Max 64 GB with MLX: 82 tok/s for 7B Q4, 68 tok/s for 9B Q4, 48 tok/s for 14B Q4, 24 tok/s for 32B Q4, and 68 tok/s for 35B-A3B MoE. M4 Max 128 GB pushes 32B to 28 tok/s and 35B-A3B MoE to 82 tok/s. Prefill is 5-8x decode. These are 2K-prompt, 8K-context measurements -- longer context drops decode 10-25%.

Is MLX faster than llama.cpp on Apple Silicon?

Yes, consistently 15-30% faster on decode for Qwen 3.5 as of April 2026. MLX exploits Metal Performance Shaders and unified memory via its native graph compiler. llama.cpp's Metal backend is competitive but still treats the GPU as a discrete device in parts of the kernel graph. The gap closes for prefill (within 5%) and for sub-7B models where compute stops being bandwidth-bound.

Can I run Qwen 3.5 on a 16 GB M4 MacBook?

Yes, up to 9B Q4_K_M (5.5 GB) with 16K context. Push GPU memory to 14 GB with sudo sysctl iogpu.wired_limit_mb=14000, close Chrome and Slack, expect 31 tok/s on 7B or 28 tok/s on 9B via MLX. 14B loads at Q3_K_M (7.1 GB) but context caps at 4K and quality drops noticeably. 32B does not fit at any usable quantization on 16 GB.

Does MLX support the Qwen 3.5 MoE variants?

Yes, 35B-A3B and 122B-A10B both run on MLX as of mlx-lm 0.18+. MoE routing happens inside the MLX graph so the full weight file loads once and experts activate without host-device copies. 122B-A10B needs M3 Ultra 96 GB minimum. 397B-A17B loads on M4 Ultra 256 GB at Q3_K_M but tok/s drops to 6-8 -- batch-only.

Why does my MacBook get slow after 10 minutes of running Qwen?

Thermal throttling. M-series MacBooks are built for spiky loads, and sustained GPU-at-90% for LLM decode overwhelms cooling on everything but the 16-inch M3 Max / M4 Max chassis. Expect tok/s to drop 10-45% after 3-14 minutes. Mac Studio and Mac mini never throttle -- a desktop Mac beats any MacBook on sustained throughput per dollar for long-running agents.

When does Apple Silicon beat an RTX 4090 for LLMs?

At 32B dense and above, and on any MoE whose weights exceed 24 GB. The 4090 cannot hold Qwen 3.5 35B-A3B (21.6 GB + KV + overhead hits 26+ GB), so you either build a dual-GPU rig or spill to system RAM and lose 15-25x throughput. M3 Ultra and M4 Max/Ultra run 35B-A3B at 82-92 tok/s natively. At 14B and smaller, the 4090 wins 40-70% on decode and 4-5x on prefill.

What is the fastest way to get Qwen 3.5 running on a new Mac?

Fresh terminal: python3 -m venv ~/.mlx-env && source ~/.mlx-env/bin/activate && pip install mlx-lm && mlx_lm.generate --model Qwen/Qwen3.5-9B-Instruct-MLX --prompt "hello" --max-tokens 100. Four commands, ~10 minutes on decent internet (most of it is the 5.5 GB download). Use mlx_lm.chat for history, mlx_lm.server for an OpenAI-compatible HTTP endpoint on localhost:8080.

Where to Go Next

Qwen 3.5 on Apple Silicon is the cheapest path to running 35B-A3B or 72B at interactive speed in a quiet room. M4 Max 128 GB or M3 Ultra 192 GB is the sweet spot; 16 GB M4 base is fine for 7B-9B if you budget carefully. MLX wins on pure speed, Ollama if you already have tooling pointed at it, llama.cpp fills in for exotic quants. April 2026 hardware ladder: MacBook Pro 14" M4 Pro 48 GB ($2,900) for portable 14B, Mac mini M4 Pro 48 GB ($1,400) for desk-bound 14B, Mac Studio M4 Max 128 GB ($3,700) for 32B and 35B-A3B, Mac Studio M3 Ultra 192 GB ($5,500) for 72B and 122B MoE.

Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second