Qwen 3.5 vs DeepSeek V4 vs GLM-5.1: Local Coding Showdown
Three frontier open-weight models compared for coding in April 2026. Qwen wins on consumer GPUs, GLM-5.1 leads SWE-Bench Pro, DeepSeek V4 has 1M context.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Qwen 3.5 vs DeepSeek V4 vs GLM-5.1: Quick Verdict
Three open-weight frontier-class models all shipped between Q1 and Q2 2026, and the right pick for local coding depends almost entirely on your hardware. Qwen 3.5 wins on consumer hardware — its 32B dense and 35B-A3B MoE are the only frontier-tier options that fit on a single 24 GB GPU. DeepSeek V4 wins on multi-card workstations or rented A100/H100 with its 1M-token context window and Engram conditional memory. GLM-5.1 wins on cost-efficient API serving, beating Claude Opus 4.6 on SWE-Bench Pro at a fraction of the price. None of them universally dominates; they live at different points on the price/quality/hardware curve.
| Model | Total params | Active params | Min local VRAM (Q4) | SWE-Bench Pro | Best for |
|---|---|---|---|---|---|
| Qwen 3.5 32B | 32B | 32B | ~20 GB | 53.2% | Single 24 GB GPU, broad ecosystem |
| Qwen 3.5 35B-A3B | 35B | 3B | ~22 GB | 52.5% | 24 GB GPU, faster decode than 32B |
| DeepSeek V4 | ~1T | ~32-37B | ~243 GB | 57.8% | 1M context, multi-card or cloud |
| GLM-5.1 | ~410B | ~32B | ~110 GB (FP4) | 58.4% | Cost-efficient API, SWE-Bench leader |
| Claude Opus 4.6 (ref) | — | — | API-only | 57.3% | Subjective code quality, agentic loops |
Last updated: April 2026 — verified SWE-Bench Pro scores from LM Council April 2026 leaderboard, official model cards on Hugging Face, and direct vendor pricing pages. Benchmark numbers shift weekly; treat the ordering as directional, not exact.
How These Models Differ Architecturally
Before the benchmarks, it helps to understand what's actually under the hood. The three models use very different architectures, and the differences predict where each wins.
Qwen 3.5 (32B dense + 35B-A3B MoE + larger MoE variants)
Alibaba's Qwen 3.5 family ships in 11 variants from 0.5B to 397B-A17B. The 32B dense and 35B-A3B MoE are the two frontier-class local-runnable picks. Apache 2.0 license — the broadest commercial-use terms in the open frontier. Mature ecosystem: every framework (llama.cpp, vLLM, MLX, Ollama, SGLang) supports Qwen 3.5 day one. The Qwen 3.5 VRAM requirements matrix covers the per-variant memory footprint and the GGUF quantization guide covers which file to download. Native context window: 128K (256K on the 397B-A17B with YaRN).
DeepSeek V4 (~1T total, ~32-37B active)
DeepSeek's flagship from March 2026. Mixture-of-experts at unprecedented scale — roughly 1 trillion total parameters with 32-37B activating per token via a sparse routing layer. The headline feature is Engram conditional memory: a learned compression module that lets the model maintain coherence across 1M tokens of context without the perplexity collapse that plagues raw-attention extension methods. Native multimodal generation (text + image). MIT-licensed weights. Cost: you need ~243 GB of memory at Q4 quantization to load it locally — realistically a 4x H100 / 8x H200 setup or a 192 GB Mac Studio + heavy CPU offloading.
GLM-5.1 (~410B MoE)
Zhipu AI's open release from late March 2026. Mixed-license — partial weights available under research terms, full weights for paid commercial use. Architectural focus: aggressive coding fine-tuning. The benchmark numbers reflect that — GLM-5.1 leads on SWE-Bench Pro and LiveCodeBench despite its smaller active-parameter count than DeepSeek V4. Context: 200K native, less hype-feature than Engram but production-usable. The big practical advantage is API pricing: through Zhipu's own API, GLM-5.1 costs about a third of Claude Opus 4.6 per million tokens, with comparable code-generation quality.
Definition: A Mixture-of-Experts (MoE) model has many "expert" sub-networks but only activates a few per token via a router. The total parameter count determines memory footprint (you need to load all experts); the active parameter count determines compute cost per token. MoE models decode faster than dense models of equivalent total size — this is why DeepSeek V4 at ~32B active runs at speeds comparable to Qwen 3.5 32B dense despite having 30x more total parameters.
Coding Benchmarks: Where Each Model Actually Wins
Coding benchmarks have known biases (SWE-Bench overweights Python/JS bug-fixes, HumanEval is saturated, LiveCodeBench rewards specific patterns) but they're the best public signal we have. Numbers below are from the LM Council April 2026 leaderboard and individual model card releases. Treat differences under 2% as noise; treat differences over 5% as real.
SWE-Bench Pro (real GitHub bug-fix tasks)
| Model | SWE-Bench Pro | SWE-Bench Verified | vs frontier API |
|---|---|---|---|
| GLM-5.1 | 58.4% | 67.2% | Beats Opus 4.6 by 1.1 pts |
| DeepSeek V4 | 57.8% | 66.4% | Within noise of Opus 4.6 |
| Claude Opus 4.6 (API) | 57.3% | 66.0% | Reference |
| GPT-5.4 xHigh (API) | 57.7% | 65.8% | ~Opus 4.6 |
| Qwen 3.5 32B dense | 53.2% | 62.1% | ~5 pts behind frontier API |
| Qwen 3.5 35B-A3B MoE | 52.5% | 61.5% | ~5 pts behind frontier API |
HumanEval and LiveCodeBench (synthesized coding)
| Model | HumanEval pass@1 | LiveCodeBench v5 | Aider polyglot |
|---|---|---|---|
| GLM-5.1 | 94.5% | 71.3% | 62.8% |
| DeepSeek V4 | 93.8% | 72.1% | 61.5% |
| Claude Opus 4.6 | 94.2% | 70.8% | 65.4% |
| Qwen 3.5 32B | 87.4% | 59.6% | 52.1% |
| Qwen 3.5 35B-A3B | 86.9% | 58.8% | 51.4% |
The headline: GLM-5.1 and DeepSeek V4 close the gap to frontier API models on coding benchmarks, while Qwen 3.5 sits roughly 5-7 points behind on every metric. That gap is the cost of being able to run on consumer hardware. For most coding work the gap doesn't matter; for complex multi-file refactors and architectural changes it does. The AI coding assistants comparison covers how these benchmark numbers translate to real IDE-integrated workflows.
Local Hardware Requirements: What You Can Actually Run
Benchmarks don't matter if you can't load the model. Realistic hardware for each:
| Hardware tier | Qwen 3.5 32B | Qwen 3.5 35B-A3B | DeepSeek V4 | GLM-5.1 |
|---|---|---|---|---|
| RTX 4090 / 3090 (24 GB) | ✓ Q4_K_M, 8K ctx | ✓ Q4_K_M | ✗ doesn't fit | ✗ doesn't fit |
| RTX 5090 (32 GB) | ✓ Q5_K_M / NVFP4 | ✓ Q5_K_M | ✗ doesn't fit | ✗ doesn't fit |
| 2x 4090 (48 GB) | ✓ Q8_0 / FP16 | ✓ Q8_0 | ✗ doesn't fit | ✗ doesn't fit |
| RTX 6000 Ada (48 GB) | ✓ FP16, 32K ctx | ✓ FP16 | ✗ doesn't fit | ✗ doesn't fit |
| M3 Max 128 GB unified | ✓ FP16 | ✓ FP16 | partial (Q3 + heavy offload) | partial (Q3 + offload) |
| M3 Ultra 192 GB unified | ✓ FP16, 128K ctx | ✓ FP16 | ✓ Q4 with offload | ✓ Q4 with offload |
| 4x H100 80 GB (320 GB) | ✓ FP16, batched | ✓ FP16, batched | ✓ Q4, 1M ctx | ✓ Q4_K_M |
| 8x H200 141 GB (1.1 TB) | overkill | overkill | ✓ FP8, full ctx | ✓ Q8_0 / FP16 |
The hardware reality is brutal: only one of these three frontier models is usable on any consumer GPU. If you don't have multi-card workstation or cloud GPU budget, your local-frontier option is Qwen 3.5. DeepSeek V4 and GLM-5.1 are practical only as API consumers or on rented hardware. The best GPU cloud for AI training comparison covers per-hour pricing on RunPod, Vast, Lambda, and the AWS/GCP H100/H200 fleets.
API Pricing: Cost per 1M Tokens (April 2026)
For most teams, the right way to use DeepSeek V4 or GLM-5.1 is via API rather than local hosting. Here's the pricing landscape as of April 2026:
| Model | Provider | Input $/1M tok | Output $/1M tok | Cache hit discount |
|---|---|---|---|---|
| GLM-5.1 | Zhipu AI direct | $0.50 | $1.50 | 50% |
| GLM-5.1 | OpenRouter | $0.55 | $1.65 | — |
| DeepSeek V4 | DeepSeek direct | $0.27 | $1.10 | 75% |
| Qwen 3.5 32B | Together AI | $0.30 | $0.90 | — |
| Qwen 3.5 32B | Fireworks | $0.35 | $1.05 | — |
| Claude Opus 4.7 (ref) | Anthropic | $15 | $75 | 90% / 25% premium |
| Claude Sonnet 4.6 (ref) | Anthropic | $3 | $15 | 90% / 25% premium |
| GPT-5.4 (ref) | OpenAI | $2.50 | $10 | 50% |
The headline: DeepSeek V4 is roughly 50x cheaper than Claude Opus 4.7 per output token, with comparable coding benchmark scores. GLM-5.1 sits at 50x cheaper too. For high-volume agentic coding loops where token spend dominates infrastructure cost, these models can pay for themselves over Claude/GPT in weeks. Trade-off: latency is higher (China-routed APIs typically 200-500ms higher TTFT than US-hosted Anthropic/OpenAI), and the routing path is less mature in 2026 than for Western frontier APIs. The LLM API pricing comparison has the deeper math across the broader provider landscape.
Pick Qwen 3.5 If You're Coding Locally
Three concrete decision rules I'd give a developer evaluating these for hands-on coding work in April 2026:
- Pick Qwen 3.5 32B (or 35B-A3B MoE) if you're running on a single consumer GPU and want frontier-class quality. The 5-point gap to GLM-5.1 / DeepSeek V4 on benchmarks is real but irrelevant for 80% of day-to-day coding. The Apache 2.0 license is the cleanest commercial-use terms in the open frontier. Most mature ecosystem: every tool supports it.
- Pick GLM-5.1 via API if you're cost-sensitive and need top-tier coding quality. Beats Claude Opus 4.6 on SWE-Bench Pro at 30x lower price per token. Right call for batch-coding loops, mass refactors, agent fleets running thousands of iterations.
- Pick DeepSeek V4 if you specifically need 1M-token context coherence (multi-file repo understanding, long-document reasoning) or if you're already comfortable running multi-card datacenter hardware locally. The Engram memory architecture is the genuine advance — if your task involves "show me how this 500-file repo handles X" type prompts, V4 outperforms everyone.
Honest Limitations Each Model Hides
Marketing pages emphasize wins; here are the things each team plays down.
Qwen 3.5 limitations
- Subjective code quality on senior-engineer review tasks lags Claude Opus measurably (the SWE-Bench gap reflects this).
- Tool-use agentic behavior is less polished than Claude Code's tuning — models occasionally produce plausible-looking but broken function calls.
- Long-context coherence (above 64K tokens) degrades meaningfully even though the model nominally supports 128K.
DeepSeek V4 limitations
- Massive memory footprint locks you out of consumer hardware entirely.
- The "1M context" benchmark wins are partly Engram-specific evaluation tasks; for general long-context reasoning the practical ceiling is closer to 256K-400K before quality degrades.
- API routing has been intermittently flaky in 2026 — Western developers occasionally see 5-10s latency spikes during Chinese business hours.
- Multimodal generation is rough at the edges; text-only is much more polished than image output.
GLM-5.1 limitations
- Smaller English context window (200K) than DeepSeek V4.
- Mixed-license terms — full commercial weights cost real money; the research-license partial weights have restrictions some companies legal teams won't accept.
- Less polished agentic behavior than Claude Code-tuned models or even Qwen 3.5's tool-use fine-tuning.
- Geographically routed by default — many Western teams hit higher latency than they're used to from Anthropic/OpenAI.
Watch out: All three models are moving targets. DeepSeek shipped V3 in late 2024 and V4 in early 2026 — that's two major versions in 18 months. Qwen 3.5 is itself a refresh of Qwen 3 from 2025. GLM-5.1 is barely a month old as of writing. Treat any specific benchmark number as a snapshot; what matters is the architectural direction (MoE scaling, conditional memory, coding-specific fine-tunes) which is converging across all three teams. The benchmark deltas you see today will likely shift by 2-3 points in any direction within 90 days.
Practical Setups for Each Pick
Qwen 3.5 32B local (RTX 4090)
ollama pull qwen3.5:32b-q4_K_M
ollama run qwen3.5:32b-q4_K_M
# Or via llama.cpp directly:
./llama-server \
--model ./qwen3.5-32b-instruct-q4_k_m.gguf \
--n-gpu-layers 99 --ctx-size 8192 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn --port 8080
GLM-5.1 via OpenRouter (drop-in OpenAI-compatible)
from openai import OpenAI
client = OpenAI(
api_key="sk-or-v1-...",
base_url="https://openrouter.ai/api/v1"
)
resp = client.chat.completions.create(
model="zhipu/glm-5.1",
messages=[{"role": "user", "content": "Refactor this..."}]
)
DeepSeek V4 via DeepSeek API direct
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
base_url="https://api.deepseek.com/v1"
)
# V4 supports 1M context — pass long context directly:
resp = client.chat.completions.create(
model="deepseek-v4",
messages=[{"role": "user", "content": full_repo_dump}]
)
For local development with hot-swapping between models, tools like Aider and Continue let you point at different model backends per session — useful for A/B testing on your actual codebase. The advanced agentic-loop tuning patterns and per-model prompt-engineering tweaks I've measured I send to the newsletter.
Frequently Asked Questions
Is DeepSeek V4 better than Claude Opus 4.6?
On SWE-Bench Pro (real GitHub bug-fix tasks), DeepSeek V4 matches Claude Opus 4.6 within noise (57.8% vs 57.3%). On subjective code-quality review by senior engineers, Opus 4.6 still wins. DeepSeek V4 is dramatically cheaper per token and has a 1M-token context advantage; Opus has more polished tool-use and more reliable latency. Pick V4 for cost; Opus for quality on hard architectural tasks.
Can I run DeepSeek V4 locally?
Only on multi-card workstations or 192 GB+ Apple Silicon. The full model needs ~243 GB at Q4 quantization. Realistic local hosts: 4x H100 / 8x H200 datacenter cluster, or a 192 GB Mac Studio with heavy CPU offloading at degraded speed (6-8 tok/s). Single consumer GPU running V4 is not possible.
What is GLM-5.1?
GLM-5.1 is Zhipu AI's flagship open-weight model from late March 2026. ~410B-parameter MoE with ~32B active per token, aggressive coding fine-tuning. Beats Claude Opus 4.6 on SWE-Bench Pro at roughly one-tenth the API price. Mixed-license: research weights free, commercial use requires paid tier. Available via Zhipu's own API, OpenRouter, and direct weight download for self-hosting.
Which open-source LLM is best for coding in 2026?
For local hosting on consumer GPUs: Qwen 3.5 32B or 35B-A3B MoE. For API-only use where cost matters: GLM-5.1 (best SWE-Bench Pro score among open models). For long-context coding work over multi-file repos: DeepSeek V4 with its 1M-token Engram memory. Avoid generic "best open-source LLM" advice — the right pick depends entirely on your hardware and budget.
What is Engram conditional memory in DeepSeek V4?
Engram is DeepSeek V4's learned compression layer that maintains coherence across 1M tokens of context. Unlike sliding-window attention (which loses old context) or YaRN extension (which causes perplexity collapse beyond training context), Engram trains a small module to summarize-and-retrieve old context dynamically. It's the only frontier-tier 1M-context implementation in 2026 that doesn't degrade noticeably past 256K tokens.
What is SWE-Bench Pro?
SWE-Bench Pro is an extended version of the SWE-Bench Verified coding benchmark. Models are given real GitHub issues from popular repositories and graded on whether they produce a patch that passes the project's existing test suite. SWE-Bench Pro adds harder tasks and broader language coverage versus the original. Top scores in April 2026 cluster around 57-58% — a meaningful jump from late-2024 frontier models in the 30-40% range.
Is Qwen 3.5 32B good enough for production coding?
For 80% of day-to-day coding work — refactors, function-level edits, test generation, code review assists — yes, Qwen 3.5 32B is production-grade. For the hardest 20% (multi-file architectural changes, subtle algorithmic bugs, large-codebase refactors), frontier API models (Claude Opus, GLM-5.1, DeepSeek V4) still outperform by 5-10 percentage points. Many teams use Qwen locally for routine work and reach for API models for hard tasks.
The Right Pick Depends on Your Hardware Budget
One sentence per decision: own a 24 GB GPU and want frontier-class local coding → Qwen 3.5. Need 1M-token context for repo-scale reasoning → DeepSeek V4 (via API or rented cluster). Want SWE-Bench-leading quality at API prices that fit a startup budget → GLM-5.1. The bigger story is that as of April 2026, the gap between open-weight frontier models and proprietary frontier APIs has effectively collapsed on coding tasks — and the price differential is so large that the open models have crossed the threshold of "default commercial choice for cost-sensitive workloads."
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Multi-Cluster Kubernetes: Argo CD ApplicationSet Patterns
When 10+ clusters or 50+ services break hand-written GitOps. ApplicationSet's four generators (cluster list, Git directory, PR, cluster decision), real production patterns (env promotion, per-tenant, multi-region failover, preview envs), and the sharp edges (template debugging, cascading mistakes, RBAC).
11 min read
AI/ML EngineeringLLM Latency: TTFT, ITL, and Why End-User Latency Isn't What You Think
LLM latency decomposes into TTFT (time to first token, 300-1500ms), ITL (inter-token, 10-30ms), and total time. Each has different causes and fixes. Why streaming dominates UX, when Cerebras/Groq beat Claude on speed, and the optimization playbook.
11 min read
DevOpsPython uv vs pip vs Poetry vs PDM: Speed Benchmarks 2026
Real benchmarks: uv installs Django + ML stack in 8s vs pip's 90s, Poetry's 50s, PDM's 38s. Why uv is fast (Rust + parallelism + PubGrub), what pip still does that uv doesn't, migration paths, and where Poetry's ergonomics still win.
12 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.