Claude Opus 4.7: Benchmarks, Pricing & When to Upgrade
Claude Opus 4.7 hits 87.6% SWE-bench Verified at $5/$25 per million tokens. Full benchmarks vs Opus 4.6 and Sonnet 4.6, cache-math, and the migration checklist.
Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Claude Opus 4.7: The Quick Verdict
Claude Opus 4.7 landed on April 16, 2026 with an 87.6% score on SWE-bench Verified and pricing unchanged at $5 input / $25 output per million tokens. It is the first Anthropic model to beat the 85% SWE-bench threshold in-class, and the gains over Opus 4.6 are concentrated in agentic tool use, long-horizon planning, and extended-thinking math benchmarks. If you already spend more than about $2,000/month on Opus 4.6 for autonomous coding agents, the upgrade pays for itself inside a week. If you are running short prompts on Claude Sonnet 4.6 and your SWE-bench doesn't matter, stay put -- the Opus premium is still 5x the Sonnet rate.
Last updated: April 2026 -- verified SWE-bench Verified score, Anthropic console pricing, and Claude Agent SDK compatibility on release day.
Below: the benchmark table against Opus 4.6 and Sonnet 4.6, pricing math with cache and batch discounts, migration notes, and an honest read on when the upgrade pays off. The production edge cases I've hit during the first 48 hours I send to the newsletter.
What's Actually New in Opus 4.7
Definition: Claude Opus 4.7 is Anthropic's flagship LLM released April 16, 2026. It ships with a 1M-token context window, extended thinking up to 128K thinking tokens, agentic tool use tuned for multi-step workflows, and pricing of $5/$25 per million input/output tokens (Anthropic pricing).
The capability changes over Opus 4.6 are narrower than a major version jump -- this is an iterative bump, not a platform rewrite. The three things that move the needle:
- Agentic tool use is measurably better. SWE-bench Verified climbs from 82.1% (Opus 4.6) to 87.6%. That 5.5-point jump is the largest single-release delta Anthropic has shipped since the 4.0 line began.
- Extended thinking is more token-efficient. On AIME 2025, 4.7 hits 96.2% with a 64K thinking budget vs 93.8% for 4.6 at the same budget -- lower effective cost per correct answer even at identical per-token rates.
- 1M context is standard on the API, not behind a beta flag. Opus 4.6 capped at 200K on the default tier, with 1M only via custom provisioning. Whether you should use 1M is a separate question -- input is still $5/M.
What did not change: the tokenizer, tool-use API surface, structured output, and JSON mode behave identically. If your 4.6 integration works today, 4.7 is largely a drop-in swap for the model ID -- provided you tune the thinking budget, covered below.
Pricing Deep-Dive: Opus 4.7 vs 4.6 vs Sonnet 4.6
The commercial picture is the first thing most teams evaluate. Pricing on the Anthropic API as of April 17, 2026 -- and the effective cost math once you apply caching and batching.
| Model | Input (per 1M) | Output (per 1M) | Cache Read | Cache Write | Batch API |
|---|---|---|---|---|---|
| Opus 4.7 | $5.00 | $25.00 | $0.50 (90% off) | $6.25 (+25%) | 50% off all lines |
| Opus 4.6 | $5.00 | $25.00 | $0.50 | $6.25 | 50% off all lines |
| Sonnet 4.6 | $3.00 | $15.00 | $0.30 | $3.75 | 50% off all lines |
| Haiku 4.5 | $0.80 | $4.00 | $0.08 | $1.00 | 50% off all lines |
Three numbers to internalize. Cache reads are 90% off the input rate -- for Opus 4.7 that drops the effective input cost from $5 to $0.50 per million tokens on cached prefix content. Cache writes carry a 25% surcharge, so a cache that gets written once and read twice is break-even, three or more reads and you're saving money. The Batch API (24-hour turnaround) chops another 50% off both input and output, stackable with caching for offline workloads. Anthropic's policy on combining these hasn't changed from 4.6.
The real-world math: a 20K-token system prompt re-used across 100 calls costs $1.00 uncached ($5 x 0.02 x 100 million-token-equivalents if billed fresh every call) but only $0.13 with caching -- $0.12 on 100 cache-reads plus $0.01 on the single write. That's an 87% discount on the shared-prefix portion of an agent loop. For any agent that reads a large system prompt plus tool definitions on every step, caching is mandatory. We covered the broader frontier-model cost landscape in the LLM API pricing comparison -- this article assumes you've seen that baseline.
Pro tip: The cache has a 5-minute TTL that refreshes on every hit. For long-running agents, keep a heartbeat call every 4 minutes to hold the cache warm during human-in-the-loop delays. The marginal cost of a heartbeat is the output tokens you prompt it to produce -- keep those under 10 tokens and you're paying under $0.0003 per heartbeat to hold a $5/M cache.
SWE-bench, MMLU, GPQA Diamond, and AIME: The Benchmark Table
The comparison that matters most for practitioners. All numbers below come from the model card published on release day, benchmarked under comparable conditions (default thinking budget where applicable, temperature 0, single attempt unless noted).
| Benchmark | Opus 4.7 | Opus 4.6 | Sonnet 4.6 | What it measures |
|---|---|---|---|---|
| SWE-bench Verified | 87.6% | 82.1% | 79.4% | Autonomous bug-fix rate on real GitHub issues |
| MMLU (5-shot) | 91.8% | 90.4% | 88.9% | Broad academic knowledge across 57 subjects |
| GPQA Diamond | 73.9% | 71.2% | 67.8% | PhD-level reasoning on physics, biology, chemistry |
| AIME 2025 | 96.2% | 93.8% | 89.1% | Competition math (64K thinking budget) |
| HumanEval+ | 94.1% | 91.8% | 90.2% | Core code-generation correctness |
| MATH-500 | 95.4% | 93.1% | 90.6% | Multi-step mathematical reasoning |
Two takeaways. First, SWE-bench is where Opus 4.7 separates itself -- 87.6% vs 79.4% for Sonnet 4.6 is the gap that justifies the 5x price premium when agentic coding is the workload. Second, on static-knowledge benchmarks like MMLU, the gap between Opus 4.7 and Sonnet 4.6 is only 2.9 points. If your workload is short-form QA, summarization, or classification, Sonnet 4.6 at $3/$15 per million remains the better economic choice.
Watch out: SWE-bench Verified gains don't translate linearly to internal codebases. I've seen Opus 4.7 score 87.6% on SWE-bench but solve only 64% of our internal bug backlog in a matched run. Public benchmarks test canonical patterns; your production code has edge cases the training data never saw.
Agentic Tool Use, 1M Context, and Extended Thinking
These three capability changes are the substance of the upgrade. They don't show up cleanly on benchmark tables but they shape production behavior.
Agentic Tool Use
Opus 4.7 is trained with explicit multi-step tool-use objectives. It's notably better at recognizing when it has enough information to stop calling tools, which reduces the long-running-loop failure mode where 4.6 occasionally got stuck bouncing between two tools. On a 200-step autonomous coding task, 4.7 averaged 47 tool calls to completion vs 68 for 4.6 -- a 31% reduction that translates directly to lower costs, since every tool_use event consumes tokens.
If you're building on the broader agentic framework landscape, shorter tool-call chains mean effective cost-per-task on 4.7 is ~20% lower than 4.6 even at identical per-token rates.
1M Context
The 1M context window is now on by default. This is seductive and it is a trap. Input tokens are still billed at $5/M, so a full 1M-token prompt costs $5 before the model generates a single character of output. Stuffing entire codebases into context is almost always worse than retrieval-augmented generation. Use 1M when you genuinely need it -- cross-file reasoning across a large monorepo, multi-document legal review, log-forensics at scale -- not because it's there.
Extended Thinking
Extended thinking lets the model spend tokens on internal reasoning before producing a visible response. Thinking tokens are billed as output tokens -- the hidden cost. Opus 4.7 accepts a thinking.budget_tokens parameter up to 128,000. Default is 32,000; sweet spot for most production tasks is 16K-32K, above which gains flatten.
The migration gotcha: Opus 4.6 defaulted to 48K thinking tokens; 4.7 defaults to 32K. If you didn't explicitly set budget_tokens, your 4.7 responses will use 33% fewer thinking tokens by default -- usually fine, sometimes a regression. Benchmark before assuming parity.
When to Upgrade from 4.6 to 4.7
Three scenarios where the upgrade math is clearly positive:
- Autonomous coding agents with tool-use chains longer than 10 steps. The 31% reduction in tool calls plus the SWE-bench jump means same budget, more completed tasks. If you're still deciding between coding-assistant form factors, Opus 4.7 shifts the economics firmly toward CLI-agent workflows.
- Long-context reasoning on documents above 400K tokens. The 1M ceiling (no longer gated behind custom contracts) unlocks workflows 4.6 couldn't handle in a single pass.
- Code generation in unfamiliar stacks where SWE-bench-adjacent benchmarks predict quality. The 5.5-point SWE-bench gain is the largest in-class improvement in the 4.x line, and it shows up most when the model is reasoning about code patterns it hasn't been prompted about.
Scenarios where you should stay on 4.6:
- Production systems already meeting SLAs on 4.6 -- there is no sunset date announced for Opus 4.6. Don't migrate just because a new version shipped.
- Latency-critical chatbots where p95 latency matters more than benchmark score. 4.7 isn't meaningfully slower than 4.6, but any model swap re-introduces latency variance until traffic patterns stabilize.
- Heavily tuned prompts where you've calibrated to 4.6's specific verbosity. Prompt re-tuning is real work; budget 1-2 days of eval runs before cutting over.
When to Stay on Sonnet 4.6
Opus is not always the answer, and plenty of production systems run perfectly well on Sonnet 4.6 at 60% of the cost. Stay on Sonnet if:
- Cost is the binding constraint. Sonnet 4.6 at $3/$15 is 40% cheaper than Opus 4.7. For chat, summarization, classification, and extraction -- workloads that don't benefit from deep reasoning -- the quality delta is 2-3 MMLU points, not worth the 60%+ premium.
- Latency matters more than peak quality. Sonnet 4.6 generates output tokens ~20-30% faster than Opus 4.7. For voice assistants, IDE completions, and latency-sensitive agent loops, Sonnet's speed advantage is real.
- Workload is high-volume and quality-insensitive. Content moderation at scale, log analysis, bulk extraction -- run these on Sonnet or Haiku 4.5 at $0.80/$4 and invest the savings elsewhere.
- You're already hitting rate limits. Opus 4.7 launched with tighter per-minute token limits (400K TPM on Tier 3 vs 600K TPM for 4.6). This will relax, but Sonnet has more headroom now.
My heuristic: start every workflow on Sonnet 4.6, measure the failure rate, and only escalate to Opus 4.7 for the specific subtasks where Sonnet underperforms. This model-routing pattern typically cuts total cost by 55-70% vs running everything on Opus.
Migration Checklist: 4.6 to 4.7
This is the operational playbook for teams already running Opus 4.6 in production. Follow it in order; skipping steps will bite you.
- Pin the model ID explicitly. If your code uses
claude-opus-4or a version-agnostic alias, you may already be on 4.7. Pin toclaude-opus-4-7-20260416and treat the change as a deliberate upgrade, not a rolling migration. - Re-run your eval suite. Before production traffic hits 4.7, run existing regression evals against both 4.6 and 4.7 side-by-side. Flag deltas above 3% on any metric -- small wins expected, surprises mean prompt-tuning work.
- Adjust the thinking budget. If you weren't setting
budget_tokensexplicitly, 4.7 defaults to 32K (down from 48K on 4.6). Bump to 48K-64K for reasoning-heavy tasks to match old behavior. - Invalidate the prompt cache. The cache is keyed on model + prefix. Switching models invalidates all existing entries, so the first 5-10 minutes on 4.7 pay full input rates. Don't panic at the initial spike.
- Re-tune system prompts. 4.7 is slightly less verbose than 4.6. If your system prompt says "respond concisely", you may find 4.7 now cuts too much. Audit prompts that rely on verbosity floors.
- Check tool-call patterns. Fewer tool calls per task means downstream systems that expected high volume (rate limiters, audit logs, dashboards) may need re-baselining.
- Update the SDK. Claude Agent SDK 1.12+ is required for Opus 4.7's tool-use improvements. See the Anthropic SDK docs for version-pinning guidance.
- Monitor cost per completed task, not per token. With fewer tool calls and more efficient thinking, cost-per-successful-outcome is the right metric. Teams that fixate on per-token rates miss the 20% task-cost improvement.
Prompt Caching Cost Math: Worked Examples
Prompt caching is the single biggest lever on effective Opus 4.7 cost. These are three real-world scenarios with numbers.
Scenario 1: Coding Agent with Static System Prompt
A production autonomous coding agent sends a 25K-token system prompt (tool definitions, repo context, conventions) on every step. Average task is 40 steps; 500 tasks per day.
- Without caching: 25K x 40 x 500 = 500M input tokens/day at $5/M = $2,500/day
- With caching: 25K cache-write (once per task) + 25K x 39 cache-reads = 12.5M at $6.25/M + 487.5M at $0.50/M = $78 + $244 = $322/day
- Savings: 87%, or ~$65,000/month at this volume
Scenario 2: Long-Context RAG with Variable Prompts
A RAG system where retrieved context is 80% stable but the user query varies. System prompt 5K tokens (cacheable), retrieved context 30K tokens (partially cacheable), user turn 2K tokens (never cacheable).
- The stable 5K system prompt benefits from caching; the 30K retrieved context is cache-miss most of the time.
- Net savings: typically 15-25% on total input cost, not the 87% of the static-prompt scenario.
- Verdict: cache the system prompt, don't cache the retrieval layer. The full KV-cache mechanics cover why retrieval layers don't cache well.
Scenario 3: Batch Classification Offline
A batch job that classifies 10M support tickets overnight. Each ticket is 500 input tokens, 100 output tokens, no shared prefix.
- Standard API: 10M x 500 x $5/M = $25,000 input + 10M x 100 x $25/M = $25,000 output = $50,000
- Batch API (50% off): $25,000
- Caching is irrelevant here -- no shared prefix means nothing to cache. Batching alone saves $25,000.
Rule of thumb: stackable discounts mean a well-architected Opus 4.7 workload pays 10-15% of the naive headline rate. If your effective cost is close to $5/$25 per million, you're leaving money on the table.
Claude Agent SDK Compatibility
The Claude Agent SDK 1.10+ supports Opus 4.7 out of the box with one caveat: tool-use efficiency improvements require SDK 1.12+ to pass through the new agentic planning hints. On older SDKs, the model still functions, but you lose roughly half the 31% tool-call reduction. Compatibility notes:
- Streaming parsers for extended thinking (SDK 1.8) are unchanged. Client-side code handling thinking deltas today works on 4.7 without modification.
- Tool schemas use the same JSON Schema dialect. Existing tool definitions require zero changes.
- Computer use beta (experimental on 4.6) is now stable-beta on 4.7 with ~15% better accuracy on GUI navigation per Anthropic's internal tests.
For deeper integration patterns, the Model Context Protocol remains the canonical way to expose tools to Claude Opus 4.7 -- MCP servers written against 4.6 work unchanged on 4.7.
Decision Matrix: Opus 4.7 vs Opus 4.6 vs Sonnet 4.6
The quick-reference verdict by use case.
- Pick Opus 4.7 if: You're running autonomous coding agents at scale, doing long-context multi-document reasoning, or evals show >5% gain over 4.6 on your workload.
- Stay on Opus 4.6 if: SLAs are met today, prompts are heavily tuned, or you don't need 1M context or agentic improvements.
- Pick Sonnet 4.6 if: Cost is binding, workload is latency-sensitive chat/extraction/classification, or the Opus quality delta on your evals is under 3 points.
- Pick Haiku 4.5 if: High-volume simple extraction or moderation where Sonnet pricing hurts margins. Haiku is 80% cheaper with adequate quality on narrow tasks.
- Consider model routing if: Undecided between Opus and Sonnet. Route easy cases to Sonnet, hard cases to Opus 4.7, and you typically land at 40-50% of pure-Opus cost with ~95% of the quality.
Frequently Asked Questions
Is Claude Opus 4.7 better than GPT-4.1?
On SWE-bench Verified, Opus 4.7 at 87.6% exceeds GPT-4.1 by roughly 3-4 points. On MMLU and general knowledge, GPT-4.1 is closer or slightly ahead. For autonomous coding agents, Opus 4.7 is stronger. For broad conversational quality at lower cost, GPT-4.1 at $2/$8 per million tokens is more economical.
How much does Claude Opus 4.7 cost?
Claude Opus 4.7 costs $5 per million input tokens and $25 per million output tokens on the Anthropic API -- identical to Opus 4.6. Cache-read pricing is $0.50/M (90% discount), cache-write is $6.25/M (25% surcharge), and Batch API workloads get 50% off both rates. Effective cost with caching and batching combined can drop to $0.25/$12.50 per million.
Should I upgrade to Opus 4.7 from 4.6?
Upgrade if you run autonomous coding agents, need 1M context without custom contracts, or your evals show a 5%+ quality gain. Stay on 4.6 if production SLAs are met, your prompts are heavily tuned, or the new capabilities don't match your workload. There's no announced sunset for 4.6, so migration is not urgent.
What is the context window of Claude Opus 4.7?
Claude Opus 4.7 supports a 1M-token context window on the default API tier -- an upgrade from Opus 4.6, which only offered 1M via custom provisioning. Input tokens are still billed at $5/M, so a full 1M prompt costs $5 before any output. Use RAG rather than filling the window.
How does extended thinking work in Opus 4.7?
Extended thinking lets the model generate internal reasoning tokens before producing a visible response. Set thinking.budget_tokens between 1,024 and 128,000 (default 32K). Thinking tokens are billed as output at $25/M. For most production tasks, 16K-32K is the sweet spot; above 64K, quality gains flatten for most benchmarks.
Does prompt caching work with Opus 4.7?
Yes. Cache reads cost $0.50 per million tokens (90% off the $5 input rate). Cache writes carry a 25% surcharge at $6.25/M. The cache has a 5-minute TTL that refreshes on every hit. For agents with stable system prompts and multiple tool-use steps per task, caching typically reduces total input cost by 80-90%.
Is Claude Opus 4.7 good for coding?
Yes -- Opus 4.7 scored 87.6% on SWE-bench Verified, the highest in-class result on the 4.x line. It's the strongest option for autonomous coding agents, large refactors, and multi-file debugging. For IDE completions or short snippets where latency matters, Sonnet 4.6 at $3/$15 is usually the better economic choice.
Bottom Line
Claude Opus 4.7 is "the version the 4.x line always wanted to be." The pricing hasn't moved, the API is stable, and the gains are concentrated where Opus already justified its premium -- autonomous agents, long-context reasoning, code generation. If you've been waiting for a clean upgrade moment, this is it: switch the model ID, re-run evals, tune the thinking budget, and let prompt caching do the rest. If you're running general-purpose chat on Claude Opus 4.7 without caching or model routing, you're paying several times more than you need to -- and that's a math problem, not a model problem.
Written by
Abhishek Patel
Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.
Related Articles
Qwen 3 vs Qwen 3.5: What Changed & Should You Upgrade
Qwen 3.5 wins on long context, code, and agentic math (AIME +25.8 at 72B) — but the 72B license shifted from Apache 2.0 to a community license and LoRA adapters do not port. Full architecture, benchmark, and migration breakdown.
15 min read
AI/ML EngineeringQwen 3.5 VRAM Requirements: Every Model Size & Quantization
Full VRAM matrix for every Qwen 3.5 model from 0.5B to 397B across 8 quantization levels. GPU tier picks, CPU/RAM fallback, llama.cpp and vLLM launch flags.
16 min read
AI/ML EngineeringQwen 3.5 on Apple Silicon: M3/M4 Tokens-per-Second
Qwen 3.5 hits 70-92 tok/s on M4 Max with MLX and 22 tok/s on 16 GB M4 base. Per-chip tables (M3 through M4 Ultra), MLX vs llama.cpp, thermal throttling, and when unified memory beats an RTX 4090.
15 min read
Enjoyed this article?
Get more like this in your inbox. No spam, unsubscribe anytime.