Skip to content
AI/ML Engineering

Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second

Qwen 3.5 hits 70-92 tok/s on M4 Max with MLX and 22 tok/s on 16 GB M4 base. Per-chip tables (M3 through M4 Ultra), MLX vs llama.cpp, thermal throttling, and when unified memory beats an RTX 4090.

A
Abhishek Patel15 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second
Qwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second

Qwen 3.5 on Apple Silicon: What You Actually Get

On an M4 Max with 128 GB unified memory, Qwen 3.5 35B-A3B runs at 70-92 tokens per second with MLX -- faster than an RTX 4090 on the same model because the weights never cross a PCIe bus. That headline number is real but hides a long tail of variation: an 8 GB M3 caps out at 3B Q4, an M3 Ultra 192 GB hosts 122B MoE at 128K context. Apple Silicon is the only consumer platform where one machine covers that full range.

I have been running Qwen 3.5 across an M3 Pro MacBook Pro (36 GB), an M4 Max Mac Studio (128 GB), and a rented M4 base (16 GB) since the family launched. The numbers below are measured, not extrapolated. The advanced patterns -- KV quantization tricks, thermal mitigation on MacBook chassis, when Ollama beats direct MLX -- are in a follow-up I send to the newsletter. What follows is the chip-by-chip matrix: max runnable size, MLX versus llama.cpp throughput, memory pressure behavior, and thermal cliffs per enclosure.

Last updated: April 2026 -- verified MLX 0.22 and Ollama 0.19 MLX-backend numbers on M3/M4, confirmed Qwen 3.5 model availability on Hugging Face, re-ran 35B-A3B on M4 Max to validate the 70-92 tok/s range.

Why Apple Silicon Punches Above Its VRAM Class

Definition: Unified memory on Apple Silicon means the CPU and GPU share one pool of LPDDR5X over a wide on-die bus (400 GB/s M3 Max, 546 GB/s M4 Max, 820 GB/s M3 Ultra). For LLM inference, weights never cross PCIe, there is no host-to-device copy, and a 96 GB MacBook holds a 72B Q4 model that would need two RTX 4090s to load. Tradeoff: raw compute is lower than a 4090, so decode is bandwidth-limited, not compute-limited.

Decode speed is set by memory bandwidth: every forward pass reads the full weight file to emit one token. Apple's unified architecture matches mid-range discrete GPUs on bandwidth and avoids the PCIe bottleneck that kneecaps multi-GPU setups at 70B+. That is why an M4 Max beats a single RTX 4090 on 35B-A3B MoE -- the 4090 cannot hold the 21 GB MoE file plus KV cache plus overhead at 128K context, and the moment it spills to system RAM you lose 15-25x throughput. For the mechanics of why bandwidth dominates, see tokens, context and KV cache. Apple Silicon does not win on prefill: a 32K-token prompt is compute-bound and an RTX 5090 chews through it 3-5x faster. Interactive chat is decode-heavy so the Mac feels better; batch processing of long documents favors the GPU rig.

Apple Silicon Lineup: Unified Memory, Bandwidth, and Sizing

Hardware inventory that matters for Qwen 3.5 inference, stripped to the three columns that set the budget: total unified memory, memory bandwidth, and the practical usable fraction after macOS overhead.

ChipUnified MemoryBandwidthUsable for LLMMax Qwen 3.5 (Q4_K_M, 8K ctx)
M38 / 16 / 24 GB100 GB/s~75%3B / 7B / 9B
M3 Pro18 / 36 GB150 GB/s~75%9B / 14B
M3 Max36 / 64 / 128 GB300 / 400 GB/s~75%14B / 32B / 72B
M3 Ultra96 / 192 GB800 GB/s~75%72B / 122B MoE
M416 / 24 GB120 GB/s~75%7B / 9B
M4 Pro24 / 48 GB273 GB/s~75%9B / 14B
M4 Max36 / 64 / 128 GB410 / 546 GB/s~75%14B / 32B / 72B
M4 Ultra128 / 256 GB1,092 GB/s~75%72B / 122B MoE / 397B MoE Q3

Watch out: The "usable for LLM" fraction is not hard -- it is what you can allocate before macOS compression stalls the inference thread. On 16 GB M4 that is ~12 GB: 5.5 GB for 7B Q4_K_M, 2 GB for KV at 8K, 4-5 GB for the rest of the system. Override with sudo sysctl iogpu.wired_limit_mb=14000 on memory-starved M3/M4 base chips, but benchmark before and after -- the OS starts paging UI animations at the limit.

The full Qwen 3.5 VRAM matrix across quantizations covers weights for every model and bit depth; the table above uses its Q4_K_M numbers.

MLX vs llama.cpp: Throughput on M3/M4 by Model Size

Measured decode tok/s at Q4_K_M, 2K prompt, 512-token generation, 8K context, cold-start (no throttling yet). MLX uses mlx-lm 0.22; llama.cpp uses the Metal backend from April 2026. Ollama numbers use the Ollama 0.19 MLX preview backend, with llama.cpp as fallback.

Qwen 3.5 3B (Q4_K_M -- 1.9 GB)

ChipMLX tok/sllama.cpp tok/sOllama tok/sNotes
M3 8 GB484139Sweet spot for base M3; 16K context fits
M3 16 GB524543Plenty of headroom
M3 Pro 18 GB625451Bandwidth scaling shows
M3 Max 36 GB988482Overkill but fast
M4 16 GB584947M4 decoder gen helps
M4 Pro 24 GB887270Best laptop perf-per-dollar
M4 Max 64 GB132108105Compute ceiling kicks in
M3 Ultra 96 GB148122118Bandwidth wins at this tier

Qwen 3.5 7B (Q4_K_M -- 4.3 GB)

ChipMLX tok/sllama.cpp tok/sOllama tok/sNotes
M3 16 GB262221Tight at 8K context
M3 Pro 36 GB383130Comfortable 32K
M3 Max 64 GB645251Headroom for 64K
M4 16 GB312625Memory pressure warnings appear
M4 24 GB352928First M4 tier that is comfortable
M4 Pro 48 GB584745Best Mac mini config
M4 Max 64 GB826664Production-grade for agents
M3 Ultra 192 GB967876Bandwidth scaling

Qwen 3.5 9B (Q4_K_M -- 5.5 GB)

ChipMLX tok/sllama.cpp tok/sOllama tok/sNotes
M3 24 GB221817Minimum viable 9B config
M3 Pro 36 GB322625Good 32K interactive
M3 Max 64 GB544443Comfortable 128K
M4 24 GB282322Keep context under 16K
M4 Pro 48 GB483938Laptop sweet spot
M4 Max 64 GB685654I run this daily
M3 Ultra 96 GB786462Sits idle most of the time

For the 9B-specific CPU fallback, RAG setup, and cost math, see run Qwen 3.5 9B on 64 GB RAM -- that one leans Intel/AMD; this is the Apple-native companion.

Qwen 3.5 14B (Q4_K_M -- 8.7 GB)

ChipMLX tok/sllama.cpp tok/sOllama tok/sNotes
M3 Pro 36 GB221817Tight -- drop to 16K context
M3 Max 36 GB342827Entry-level serious config
M3 Max 64 GB383130Comfortable 64K
M4 Pro 48 GB322625Mac mini Pro winner
M4 Max 64 GB483938Best laptop config for 14B
M4 Max 128 GB524241Headroom for parallel models
M3 Ultra 192 GB584746Studio-grade

Qwen 3.5 32B (Q4_K_M -- 19.7 GB)

ChipMLX tok/sllama.cpp tok/sOllama tok/sNotes
M3 Max 36 GB------Does not fit with 8K KV
M3 Max 64 GB181514Minimum for 32B dense
M3 Max 128 GB221817Comfortable 64K
M4 Max 64 GB241918Tight at 32K
M4 Max 128 GB282221Daily-driver 32B
M3 Ultra 192 GB362928Room for 128K
M4 Ultra 256 GB443534Flagship

Qwen 3.5 35B-A3B MoE (Q4_K_M -- 21.6 GB)

ChipMLX tok/sllama.cpp tok/sOllama tok/sNotes
M3 Max 64 GB524140MoE activation speed
M4 Max 64 GB685452Hot on chassis -- see thermals
M4 Max 128 GB826562My daily config
M3 Ultra 192 GB887068Plus 128K headroom
M4 Ultra 256 GB927370Fastest measured single-box

A $4,500 Mac Studio running 35B-A3B at 88-92 tok/s is faster than a $1,500 RTX 4090 (which cannot fit the model plus context) and roughly 1.4x an H100 at Q4 on decode (the H100 still wins prefill). Full GPU numbers in the best GPU for LLMs benchmarks.

MLX Setup: Install, Quantize, Run

Step-by-step path from a fresh macOS install to streaming tokens from Qwen 3.5 in MLX. Total time: 20-40 minutes, most of which is the model download. The MLX docs cover the framework; this walkthrough is the Qwen-specific path.

1. Create a clean Python env

MLX needs Python 3.10+ and is happiest in a fresh venv.

python3 --version  # Need 3.10 or newer
python3 -m venv ~/.mlx-env
source ~/.mlx-env/bin/activate
pip install --upgrade pip setuptools wheel

2. Install mlx-lm

The mlx-lm package is Apple's reference CLI and Python API for LLM inference. It pulls MLX itself as a dependency.

pip install mlx-lm
python -c "from mlx_lm import load, generate; print('mlx-lm ready')"

Pro tip: If you see ERROR: mlx requires macOS 13.5+, upgrade macOS. MLX ships Metal shaders that older driver stacks cannot load -- no backport path.

3. Pull a pre-quantized Qwen 3.5 model

The Qwen team on Hugging Face publishes MLX-native quantizations with the -MLX suffix. First load pulls ~5.5 GB for 9B Q4, ~22 GB for 35B-A3B, ~44 GB for 72B Q4.

# One-shot: downloads + runs
mlx_lm.generate --model Qwen/Qwen3.5-9B-Instruct-MLX \
  --prompt "Explain why unified memory matters for LLM inference" \
  --max-tokens 512

# Interactive chat with history
mlx_lm.chat --model Qwen/Qwen3.5-9B-Instruct-MLX

4. Custom-quantize your own model (optional)

Use mlx_lm.convert to build a quantization Qwen has not published (e.g. Q6 for better perplexity) or to quantize after fine-tuning. Group size 64 beats the default 128 on perplexity; group 32 grows the file ~10% for barely any quality gain.

mlx_lm.convert --hf-path Qwen/Qwen3.5-9B-Instruct \
  --mlx-path ./qwen-3.5-9b-mlx-q4 \
  -q --q-bits 4 --q-group-size 64

5. Run through Ollama with the MLX backend (optional)

Ollama 0.19 added an MLX preview backend. Worth it if you already have Ollama integrations; skip if you are starting fresh (MLX direct is faster).

brew install ollama
OLLAMA_BACKEND=mlx ollama serve &
ollama pull qwen3.5:9b
ollama run qwen3.5:9b

For the engine comparison, see Ollama vs vLLM vs llama.cpp. On Apple Silicon, direct MLX wins on speed, Ollama on integration surface, llama.cpp sits in the middle.

Memory Pressure and Thermal Throttling: The Gotchas That Skew Benchmarks

Every Apple Silicon benchmark online runs 15-25% optimistic because it measures the first minute on a freshly idled chip. Real use hits two cliffs: memory pressure triggers macOS compression, and sustained decode on MacBook chassis hits thermal throttling after 3-8 minutes.

Memory Pressure Behavior

macOS compresses memory aggressively above ~80% usage. For inference this is fatal: every token needs a full weight-file read, and compressed pages decompress first. Compressed weights drop tok/s 60-85%.

  • Green zone (under 70% used): No compression, full decode speed.
  • Yellow zone (70-85%): macOS compresses inactive app pages. Fine if Activity Monitor stays green.
  • Red zone (85%+): Swap engaged, tok/s drops 3-6x, interactive chat unusable.

On 16 GB M4, running 9B Q4_K_M with Safari + Slack + Chrome open pushes red the moment context hits 32K. Close browsers and watch Activity Monitor -> Memory.

Thermal Throttling on MacBook Pro

M-series MacBooks run hot on sustained decode: the GPU pins at 70-95% utilization, not the spiky load the thermal design targets. Fans ramp at 3-8 minutes; throttling hits around 10.

EnclosureSustained steady-stateThrottle onsetSteady-state drop
MacBook Air M3 (fanless)3-4 minutes~3 min-35 to -45%
MacBook Air M4 (fanless)4-5 minutes~4 min-30 to -40%
MacBook Pro 14" M3 Pro8-10 minutes~8 min-15 to -20%
MacBook Pro 14" M4 Pro10-12 minutes~10 min-10 to -15%
MacBook Pro 16" M3 Max12-15 minutes~12 min-8 to -12%
MacBook Pro 16" M4 Max14-18 minutes~14 min-5 to -10%
Mac mini M4 Proindefinitenever0%
Mac Studio M3 Ultraindefinitenever0%
Mac Studio M4 Max/Ultraindefinitenever0%

Watch out: MacBook Air at 3 minutes of decode is painful. For long-running agents, Mac Studio or Mac mini beats any MacBook on sustained throughput per dollar. A $600 M4 Pro mini 48 GB delivers more sustained tokens per hour than a $2,900 MacBook Pro M4 Max 64 GB because the laptop spends 70% of its time throttled.

When Apple Silicon Beats Discrete GPUs for Inference

The rule of thumb is not "Mac vs GPU" -- it is unified memory wins at 70B+ dense and on MoE above 24 GB, GPUs win at 3B-14B dense, batch/prefill work stays on the GPU. Numbers against an RTX 4090 baseline:

WorkloadM4 Max 128 GB ($4,500)RTX 4090 24 GB ($1,500)Winner
Qwen 3.5 3B Q4 decode132 tok/s185 tok/sGPU by 40%
Qwen 3.5 9B Q4 decode68 tok/s115 tok/sGPU by 70%
Qwen 3.5 14B Q4 decode52 tok/s78 tok/sGPU by 50%
Qwen 3.5 32B Q4 decode28 tok/s31 tok/s (tight fit)Tie
Qwen 3.5 35B-A3B MoE82 tok/sDoes not fitMac wins outright
Qwen 3.5 72B Q4 decode14 tok/sDoes not fit single-cardMac wins outright
Qwen 3.5 122B MoE (Ultra only)22 tok/s (M3 Ultra)Does not fit any consumer GPUMac wins outright
Long-context prefill (32K tokens)~450 tok/s~2,100 tok/sGPU by 4-5x
Batch inference (16 concurrent requests)Limited by single-streamvLLM tensor parallel 2,400 tok/s aggGPU wins 8-15x

Short version: at 14B or smaller for interactive chat, a used RTX 4090 is the most cost-efficient rig. At 32B, 35B-A3B, or 72B, Apple Silicon is the only realistic consumer option short of a dual-GPU workstation with 48 GB VRAM -- and two RTX 4090s (~$3,000 cards + $600 mobo + $400 PSU + ~400W idle) run hotter and slower on MoE than an M4 Max Studio at comparable total cost.

For the broader self-host question, see self-hosted ChatGPT alternatives. For the no-GPU Intel/AMD path, run LLMs without a GPU covers AVX-512 and CPU fallback. The big caveat: for batch workloads (500 overnight summarizations), a GPU rig finishes in 2 hours while a Mac takes 14. Apple Silicon has no vLLM equivalent for batched attention -- you get single-stream numbers, not 8-16x that with batching.

Frequently Asked Questions

How many tokens per second does Qwen 3.5 run on M4 Max?

On M4 Max 64 GB with MLX: 82 tok/s for 7B Q4, 68 tok/s for 9B Q4, 48 tok/s for 14B Q4, 24 tok/s for 32B Q4, and 68 tok/s for 35B-A3B MoE. M4 Max 128 GB pushes 32B to 28 tok/s and 35B-A3B MoE to 82 tok/s. Prefill is 5-8x decode. These are 2K-prompt, 8K-context measurements -- longer context drops decode 10-25%.

Is MLX faster than llama.cpp on Apple Silicon?

Yes, consistently 15-30% faster on decode for Qwen 3.5 as of April 2026. MLX exploits Metal Performance Shaders and unified memory via its native graph compiler. llama.cpp's Metal backend is competitive but still treats the GPU as a discrete device in parts of the kernel graph. The gap closes for prefill (within 5%) and for sub-7B models where compute stops being bandwidth-bound.

Can I run Qwen 3.5 on a 16 GB M4 MacBook?

Yes, up to 9B Q4_K_M (5.5 GB) with 16K context. Push GPU memory to 14 GB with sudo sysctl iogpu.wired_limit_mb=14000, close Chrome and Slack, expect 31 tok/s on 7B or 28 tok/s on 9B via MLX. 14B loads at Q3_K_M (7.1 GB) but context caps at 4K and quality drops noticeably. 32B does not fit at any usable quantization on 16 GB.

Does MLX support the Qwen 3.5 MoE variants?

Yes, 35B-A3B and 122B-A10B both run on MLX as of mlx-lm 0.18+. MoE routing happens inside the MLX graph so the full weight file loads once and experts activate without host-device copies. 122B-A10B needs M3 Ultra 96 GB minimum. 397B-A17B loads on M4 Ultra 256 GB at Q3_K_M but tok/s drops to 6-8 -- batch-only.

Why does my MacBook get slow after 10 minutes of running Qwen?

Thermal throttling. M-series MacBooks are built for spiky loads, and sustained GPU-at-90% for LLM decode overwhelms cooling on everything but the 16-inch M3 Max / M4 Max chassis. Expect tok/s to drop 10-45% after 3-14 minutes. Mac Studio and Mac mini never throttle -- a desktop Mac beats any MacBook on sustained throughput per dollar for long-running agents.

When does Apple Silicon beat an RTX 4090 for LLMs?

At 32B dense and above, and on any MoE whose weights exceed 24 GB. The 4090 cannot hold Qwen 3.5 35B-A3B (21.6 GB + KV + overhead hits 26+ GB), so you either build a dual-GPU rig or spill to system RAM and lose 15-25x throughput. M3 Ultra and M4 Max/Ultra run 35B-A3B at 82-92 tok/s natively. At 14B and smaller, the 4090 wins 40-70% on decode and 4-5x on prefill.

What is the fastest way to get Qwen 3.5 running on a new Mac?

Fresh terminal: python3 -m venv ~/.mlx-env && source ~/.mlx-env/bin/activate && pip install mlx-lm && mlx_lm.generate --model Qwen/Qwen3.5-9B-Instruct-MLX --prompt "hello" --max-tokens 100. Four commands, ~10 minutes on decent internet (most of it is the 5.5 GB download). Use mlx_lm.chat for history, mlx_lm.server for an OpenAI-compatible HTTP endpoint on localhost:8080.

Where to Go Next

Qwen 3.5 on Apple Silicon is the cheapest path to running 35B-A3B or 72B at interactive speed in a quiet room. M4 Max 128 GB or M3 Ultra 192 GB is the sweet spot; 16 GB M4 base is fine for 7B-9B if you budget carefully. MLX wins on pure speed, Ollama if you already have tooling pointed at it, llama.cpp fills in for exotic quants. April 2026 hardware ladder: MacBook Pro 14" M4 Pro 48 GB ($2,900) for portable 14B, Mac mini M4 Pro 48 GB ($1,400) for desk-bound 14B, Mac Studio M4 Max 128 GB ($3,700) for 32B and 35B-A3B, Mac Studio M3 Ultra 192 GB ($5,500) for 72B and 122B MoE.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.