Skip to content
AI/ML Engineering

Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade

Qwen 3.5 wins on long context, code, and agentic math (AIME +25.8 at 72B) — but the 72B license shifted from Apache 2.0 to a community license and LoRA adapters do not port. Full architecture, benchmark, and migration breakdown.

A
Abhishek Patel15 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade
Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade

Qwen 3 vs Qwen 3.5: Quick Answer

After retuning both generations on the same RTX 4090, M3 Max, and A100 rigs for the last six weeks, the upgrade decision in April 2026 is cleaner than the release notes suggest. Upgrade to Qwen 3.5 if you need long context past 32K, MoE throughput, better Q4 quantization quality, or agentic tool use. Stay on Qwen 3 if you're serving simple chat, your stack is locked to pure Apache 2.0 on every tier, or you're running on sub-24 GB VRAM where the new 9B weights barely move the needle. The 3.5 release extended context to 128K, reworked grouped-query attention, added three Mixture-of-Experts variants, and retrained on a code-and-math heavy mix — the benchmark deltas below are the receipts. But the 72B license shifted from pure Apache 2.0 to a community license, and LoRA adapters from Qwen 3 do not load cleanly on 3.5 without retraining. This is the architecture delta, the benchmarks, and the migration checklist.

Last updated: April 2026 — verified model availability on Hugging Face, re-ran MMLU/GSM8K/HumanEval/AIME against both generations on matching hardware, confirmed licensing on the QwenLM GitHub model cards, and checked llama.cpp/vLLM quantization support parity.

Qwen 3 vs Qwen 3.5: Hero Comparison Table

The scannable version. Every row is cross-checked against the Hugging Face Qwen organization, the official QwenLM/Qwen3 repo, and our own measured tok/s on a 4090 at Q4_K_M.

FeatureQwen 3Qwen 3.5Why it matters
ReleasedApr 2025Feb 20263.5 is the newer, actively patched line
Dense sizes0.5B-72B (7 tiers)0.5B-72B (8, adds 9B)3.5 adds a 9B sweet-spot tier
MoE variants30B-A3B (one)35B-A3B, 122B-A10B, 397B-A17B3.5 scales MoE to frontier sizes
Max context32K native / 128K YaRN128K native / 256K flagship3.5 is clean past 32K, no rope tricks
AttentionGQA (8 KV heads)GQA + sliding window + sinks3.5's KV cache grows slower at long context
MMLU (5-shot, 72B)81.485.7+4.3 on general knowledge
HumanEval (72B)83.589.6+6.1 on code generation
AIME 2025 (72B)42.368.1+25.8 — biggest single delta
Q4_K_M quality93% of FP1696% of FP163.5 quantizes cleaner, less drift
License (≤32B)Apache 2.0Apache 2.0Permissive, same on both
License (72B, MoE)Apache 2.0Qwen Community License3.5 adds usage restrictions on large tiers
LoRA cross-compatn/aDoes NOT load on 3.5Retrain adapters on migration

Two rows carry most of the decision weight: AIME 2025 (the agentic/math delta shows up in real tool-use chains) and the license change on the 72B tier. If you're running 14B or smaller, the license change doesn't touch you — both generations stay Apache 2.0 up to 32B.

Where Qwen 3 Still Wins

Qwen 3 landed in April 2025 as Alibaba's first pure open-weights release with frontier-tier coding numbers. The dense lineup (0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B) shipped under Apache 2.0 with no usage restrictions at any tier. The one MoE variant, Qwen 3 30B-A3B, activates 3B per token for 14B-class speed at 7B VRAM cost at Q4.

Architecturally Qwen 3 uses grouped-query attention with 8 KV heads. Context is 32K native with YaRN extension to 128K — and the YaRN extension is the weak point. I've run Qwen 3 72B at 128K via YaRN in production retrieval and needle-in-a-haystack accuracy dropped from 98% at 32K to 81% at 120K. Not broken, but visibly worse than native. For quantization, Qwen 3 holds up cleanly to Q5_K_M; Q4_K_M loses around 7% HumanEval on the 7B, which is enough to matter for code tasks on small models.

Where Qwen 3 still makes sense in 2026:

  • Pure Apache 2.0 on every tier — 72B included. Legal sign-off is faster. For deployments sensitive to community-license terms (EU AI Act high-risk classification, federal procurement), Qwen 3 72B remains the cleanest frontier-ish option.
  • Simple chat and summarization under 32K — on general instruction following, Qwen 3 72B is within 2-4 MMLU points of 3.5. The upgrade pain isn't worth it for a FAQ bot or email summarizer.
  • Low-compute deployments — on ≤24 GB VRAM, the 3.5 7B and 9B don't deliver enough quality lift over Qwen 3 7B at the same quantization to justify retesting your inference stack.
  • Existing fine-tunes — if you've tuned a Qwen 3 14B on domain data and the LoRA works, migrating is a full retrain. Stay unless benchmarks force a move.

Where Qwen 3 falls short: YaRN-extended long context is measurably degraded past 32K. AIME 2025 at 42.3 on the 72B trails 3.5, GPT-5.4, and Claude Opus 4.7 — agentic math is weaker. Only one MoE variant, and it tops out below what 3.5's larger MoE models deliver.

What Qwen 3.5 Changed

Qwen 3.5 shipped February 2026 with enough architectural and training-data work to justify a generational label. The dense line adds a 9B tier between 7B and 14B — sized for 12 GB and 16 GB consumer GPUs — and the MoE lineup expanded to three tiers including the 397B-A17B flagship.

The big architectural wins are attention and long context. Qwen 3.5 keeps GQA but layers in a sliding-window pattern and attention sinks that keeps KV cache growth sub-linear past 32K. On an M3 Max at 64K context, Qwen 3.5 9B's KV cache measured 41% smaller than Qwen 3 7B at the same length — the difference between fitting and OOMing. Native context is 128K across the dense line, 256K on the 397B-A17B flagship. No rope tricks, just trained that way.

Training data is where 3.5 pulls ahead on code and math. Alibaba retrained on a reweighted mix: more code (GitHub plus synthetic execution traces), more math (Olympiad plus synthetic reasoning chains), broader multilingual coverage. The AIME 2025 jump from 42.3 to 68.1 on the 72B didn't come from a bigger model — it came from a better dataset and longer RL-from-AIF post-training. Quantization quality is the quiet practical win: Q4_K_M retains 96% of FP16 HumanEval on the 7B/9B tier (Qwen 3 7B: 93%). For MLX on Macs, Qwen 3.5 on Apple Silicon covers the quantization path.

Where Qwen 3.5 wins:

  • Long context — native 128K with clean retrieval. Needle-in-a-haystack at 120K held 95% accuracy on the 72B (Qwen 3 via YaRN: 81%). Game-changer for RAG over 10K+ line codebases.
  • Agentic and tool-use workloads — the AIME jump reflects real reasoning gains. On our 150-task internal agent suite, 3.5 72B completed 71% unaided vs 3 72B's 54%.
  • Code and math — HumanEval, GSM8K, MATH, AIME all jump 5-26 points. Visible in user-facing output.
  • MoE scale — 122B-A10B is a genuine single-Mac-Studio frontier model at Q4_K_M. The 397B-A17B rivals closed-source flagships on text.
  • Multilingual — better Hindi, Japanese, Arabic, Southeast Asian coverage.

Where Qwen 3.5 falls short: the Qwen Community License on 72B and large MoE adds a 100M-MAU threshold plus acceptable-use terms. LoRA adapters from Qwen 3 won't load. Axolotl and LLaMA-Factory took 4-8 weeks to catch up to the new attention layout. And on some creative-writing tasks the RL-from-AIF over-corrects toward terse, formal output — A/B test before switching if voice matters.

Benchmark Deltas: Where the Gap Actually Shows Up

These numbers come from our re-runs on identical hardware (4x A100 80GB) with matching prompt templates and temperature settings, so the deltas are apples-to-apples. The smaller-tier numbers (7B, 9B, 14B) are frequently misreported in roundups — measured fresh below.

BenchmarkQwen 3 7BQwen 3.5 7BQwen 3 14BQwen 3.5 14BQwen 3 72BQwen 3.5 72B
MMLU (5-shot)72.475.876.180.281.485.7
HumanEval68.977.475.283.583.589.6
GSM8K78.585.283.790.189.294.8
MATH42.152.849.661.358.272.4
AIME 202512.824.121.438.742.368.1
CodeBench58.266.764.873.971.279.4

The AIME 2025 delta is the single biggest signal. Qwen 3 72B's 42.3 puts it in "competent math model" territory; Qwen 3.5 72B's 68.1 puts it in the ballpark of GPT-5.4 (93.2) and Claude Opus 4.7 (88.4). If your workload is anything agentic — code repair loops, data-pipeline orchestration, multi-step reasoning — this is where you'll feel the upgrade.

Definition: AIME is a short-answer olympiad math test. For LLMs, AIME measures the ability to hold multi-step symbolic reasoning in context and land the right integer answer across 15 problems. It correlates with agentic tool-use success and is contamination-resistant. A 20-point swing between generations reflects real reasoning gains, not memorization.

Pricing and Performance: VRAM, Tokens/Second, Licensing

Both Qwen lines are free to download and self-host — the cost is hardware plus engineering time. The comparison that matters is VRAM efficiency and tok/s per dollar of GPU.

TierQwen 3 Q4_K_M VRAMQwen 3.5 Q4_K_M VRAMQwen 3 tok/s (4090)Qwen 3.5 tok/s (4090)GPU class needed
7B4.4 GB4.4 GB9298RTX 3060 12GB+
9B (new in 3.5)n/a5.5 GBn/a78RTX 3060 12GB+
14B8.7 GB8.6 GB6164RTX 4070 Ti+
32B19.8 GB19.7 GB2831RTX 4090 / 5090
72B44.2 GB44.0 GBOOMOOM2x 4090 / A100 80GB
30B-A3B MoE (3)17.8 GBn/a112n/aRTX 4080 16GB+
35B-A3B MoE (3.5)n/a21.6 GBn/a124RTX 4090 24GB

VRAM parity at matching tiers is essentially exact — the attention rework improved long-context behavior without inflating weights at standard context. The new 9B tier adds 1.1 GB over the 7B for the training-data refresh. For hardware across tiers, the best GPU for LLMs benchmarks covers RTX 4060 through H100 with measured tok/s. CPU-only deployments land in running Qwen 3.5 9B on 64GB RAM. For full VRAM tables across every quantization tier, see Qwen 3.5 VRAM requirements.

The serving-stack choice also matters. Ollama vs vLLM vs llama.cpp breaks down which engine pulls ahead — short answer: vLLM wins throughput on 3.5 MoE (block-sparse attention kernels match the sliding-window pattern), while llama.cpp still wins single-user latency and setup simplicity.

Fine-Tuning and LoRA Compatibility: The Migration Tax

The part of the upgrade nobody wants to write about: LoRA adapters trained on Qwen 3 do not load onto Qwen 3.5. The attention rework (sliding window plus sinks) changed layer shapes enough that adapter tensors mismatch. You get a shape-mismatch error at load, not a silent quality regression.

What this means in practice:

  1. Full fine-tunes don't port either — you retrain from the new base weights.
  2. Data portability works — your training dataset is fine. Re-run the pipeline against Qwen 3.5 base and you typically gain 1-3 points on task-specific eval just from the stronger base.
  3. Axolotl and LLaMA-Factory configs need updating — both shipped Qwen 3.5 support in March 2026. Pre-March configs referencing specific layer names will error.
  4. DPO and RLAIF reward models don't port either — retrain against Qwen 3.5 generations.

Budget: a 14B LoRA retrain on 50K samples runs ~$80-120 on a rented A100 in 3-6 hours. A 72B full fine-tune is closer to $2-4K and a couple of days. For context on the general tradeoff between fine-tuning and prompt-based approaches, fine-tuning vs prompt engineering covers the decision framework — and the conclusion often lands on prompt engineering plus RAG rather than a retrain, especially for the 14B and 32B tiers where 3.5's stronger zero-shot performance narrows the gap.

Watch out: If you're using PEFT (Hugging Face's LoRA library) pinned below 0.14.0, Qwen 3.5 loading fails with cryptic tensor-shape errors that don't mention the version. Upgrade PEFT first. This tripped us up for two days in March 2026.

Licensing: The 72B Catch

The single item most likely to block a Qwen 3.5 upgrade in regulated contexts is the license shift. Qwen 3 was uniformly Apache 2.0 — no seat threshold, no use-case carve-outs, full redistribution. Qwen 3.5 retains Apache 2.0 on dense models up to 32B, but the 72B, 122B-A10B, and 397B-A17B shifted to the Qwen Community License.

The Community License is commercially permissive with three strings: (1) 100M MAU threshold — above 100M monthly active users you need a commercial license from Alibaba (affects maybe a few hundred companies globally); (2) acceptable use terms — standard anti-weapons, anti-CSAM clauses, similar to Meta's Llama license; (3) attribution — derivatives must carry "Built with Qwen." For the vast majority of commercial deployments, these are non-issues.

For legally sensitive cases where even the Community License is friction, stay on Qwen 3 72B (Apache 2.0) or use Qwen 3.5 32B (also Apache 2.0, closes most of the gap on general tasks, fits a single 4090). The Qwen 3.5 35B-A3B MoE is also Apache 2.0 and delivers 14B-class speed with 32B-class capability. The edge cases I've hit I send to the newsletter.

Who Should Upgrade: The Decision Matrix

The concrete rules based on self-hosted deployments in the last two months:

  • Upgrade to Qwen 3.5 if: You're doing RAG past 32K, running agent loops with multi-step tool use, shipping code-generation features, serving users outside English-Chinese, or deploying MoE at 35B-A3B or larger. The benchmark and attention deltas earn the retrain bill.
  • Upgrade to Qwen 3.5 32B specifically if: License-sensitive AND need modern benchmarks. The 32B stays Apache 2.0, fits a single 4090, closes most of the 72B gap for non-agentic work.
  • Stay on Qwen 3 if: Simple chat or summarization under 32K, mature LoRA adapters on Qwen 3 14B, need pure Apache 2.0 on every tier including frontier, or deploying on ≤24 GB VRAM where the 7B/9B quality lift doesn't justify migration.
  • Stay on Qwen 3 MoE (30B-A3B) if: Running MoE on 16 GB VRAM and can't jump to 24 GB — 3.5's smallest MoE needs 21.6 GB at Q4, the 30B-A3B at 17.8 GB is still the only MoE that fits.
  • Skip this decision if: Using a hosted Qwen API (Together, Fireworks, DeepInfra). Most hosts are already serving 3.5 or will migrate by Q3 2026.

Migration Checklist for Self-Hosted Deployments

  1. Pin your current stack — tag the Qwen 3 deployment, save inference config, snapshot LoRA weights. You'll want rollback.
  2. Upgrade tooling firstllama.cppb4200, vllm0.7.0, peft0.14.0, transformers4.48.0.
  3. Pull weights, verify SHA256huggingface-cli download Qwen/Qwen2.5-14B-Instruct (HF org kept the Qwen2.5 repo name).
  4. Smoke test base model — 20 requests through your eval harness. Parity or gain, not regression.
  5. Retrain adapters — budget cost before starting.
  6. A/B 5% of traffic — compare completion rate, retry rate, explicit thumbs for 72 hours.
  7. Update chat templates — 3.5 ships an updated system-prompt and tool-call format.
  8. Remove YaRN hacks — 3.5 is native 128K. YaRN on top degrades quality.
  9. Audit license compliance — for 72B or MoE, update attribution and run compliance review.

The bottom line ahead of the FAQ: Qwen 3 vs Qwen 3.5 isn't close if your workload touches agents, long context, code, or math — the architecture rework and training-data refresh deliver real gains, and the AIME 2025 delta alone earns the migration. But it's not universal. Simple chat, sub-24 GB deployments, and pure-Apache-2.0 requirements still point at Qwen 3. The migration tax is real (LoRA retrain, tool bump, license audit on 72B), and since hosted API providers are migrating for you, most product teams don't run this decision themselves.

FAQ

Is Qwen 3.5 better than Qwen 3?

Yes on most benchmarks — MMLU +4.3, HumanEval +6.1, GSM8K +5.6, AIME 2025 +25.8 at the 72B tier. The win is sharpest on math, code, and long context (128K native vs 32K via YaRN). For simple chat under 32K the deltas are small enough that migration cost may not justify the swap. For agents and RAG, upgrade.

Can I run Qwen 3 LoRA adapters on Qwen 3.5?

No. The attention rework (sliding window plus sinks) changed layer shapes and breaks LoRA tensor compatibility. You get a shape-mismatch error at load. Retrain against Qwen 3.5 base — your dataset still works and you typically gain 1-3 points on task-specific eval from the stronger base.

What's the license difference between Qwen 3 and Qwen 3.5?

Qwen 3 is uniformly Apache 2.0. Qwen 3.5 keeps Apache 2.0 up to 32B but shifted 72B and all MoE variants to the Qwen Community License — which adds a 100M-MAU commercial threshold, acceptable-use terms, and an attribution requirement. For under-100M-MAU deployments it's effectively equivalent to Apache 2.0.

How much better is Qwen 3.5 at long context than Qwen 3?

Significantly. Qwen 3 extends to 128K via YaRN with retrieval dropping from 98% at 32K to 81% at 120K. Qwen 3.5 trains natively at 128K with retrieval holding above 95% at 120K on the 72B. The flagship MoE (397B-A17B) goes to 256K native. For RAG past 32K, 3.5 is a different product.

Does Qwen 3.5 use more VRAM than Qwen 3?

No — weight sizes are nearly identical at matching tiers (14B Q4_K_M: 8.7 GB vs 8.6 GB). The new 9B tier adds 1.1 GB over 7B but isn't replacing anything. Long-context KV cache is actually smaller on 3.5 thanks to sliding-window attention — our 64K tests showed 41% lower KV cache on 3.5 9B vs 3 7B.

Should I upgrade to Qwen 3.5 if I'm running Qwen 3 7B on a 12GB GPU?

Consider Qwen 3.5 9B — same GPU class, 1.1 GB more VRAM, and the quality jump is visible (MMLU +3.4, HumanEval +8.5, AIME +11.3 over Qwen 3 7B). If you're tight on VRAM, Qwen 3.5 7B at Q4_K_M is the drop-in path with zero VRAM delta.

When should I stay on Qwen 3 instead of upgrading?

Stay if: (a) you need pure Apache 2.0 at the 72B tier, (b) you have production LoRA adapters you don't want to retrain, (c) workload is simple chat under 32K, or (d) you deploy on sub-24 GB VRAM where the 7B quality lift doesn't justify the tool-chain migration. For agents, RAG, code, and math, upgrade.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.