Claude Opus 4.7 Review: Benchmarks, Features, and What Changed

Claude Opus 4.7 is live. Not “coming soon,” not “leaked,” not “this week.” Live — as of a few hours ago.

After three days of Polymarket odds, The Information exclusives, and stock market jitters, Anthropic shipped its next flagship model on April 16, 2026. Model ID: claude-opus-4-7. Available right now on the Claude platform, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

The AI Design Tool that everyone expected to ship alongside it didn’t arrive on day one. But 24 hours later, it did — as Claude Design, powered by Opus 4.7 and launched under Anthropic’s new “Anthropic Labs” sub-brand. 👉 Hands-on review of Claude Design →

This post focuses on Opus 4.7 the model. The numbers are worth paying attention to.

What Is Claude Opus 4.7?

If you’re new to Claude, here’s the quick version: Claude is Anthropic’s AI assistant — similar to ChatGPT or Google Gemini. Opus is their most powerful model, the one you use for hard problems. Version 4.7 replaces version 4.6, which launched in February 2026.

For developers and power users: Opus 4.7 is an incremental but meaningful upgrade over 4.6. Better at coding, better at vision, new effort controls, and a multi-agent code review command. Not a ground-up rewrite — more like a car that got a tuned engine, a sharper camera, and a new driving mode.

The Benchmarks: What Actually Improved

Numbers first. Here’s how Opus 4.7 compares to its predecessor and the competition:

Opus 4.7 vs Opus 4.6

Benchmark	Opus 4.7	Opus 4.6	Change
SWE-bench Verified (real-world coding)	87.6%	80.8%	+6.8
SWE-bench Pro (harder coding)	64.3%	~60%	+4.3
CursorBench (IDE coding)	70%	58%	+12
GPQA Diamond (graduate reasoning)	94.2%	~91%	+3
Terminal-Bench 2.0 (CLI tasks)	69.4%	—	new
OSWorld-Verified (desktop automation)	78.0%	—	new
MCP-Atlas (tool use)	77.3%	75.8%	+1.5
MMMLU (multilingual Q&A)	91.5%	91.1%	+0.4

The headline number: SWE-bench Verified jumped 6.8 points — from 80.8% to 87.6%. That’s the benchmark that measures whether an AI can actually fix real bugs in real open-source repositories. A 6.8-point jump in one release is significant. For context, the jump from Opus 4.5 to 4.6 was about 5 points.

CursorBench — which tests AI coding inside the Cursor IDE — jumped 12 points. That’s the biggest single-benchmark improvement in the release.

Opus 4.7 vs the Competition

Benchmark	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
SWE-bench Verified	87.6%	~83%	~81%
GPQA Diamond	94.2%	94.4%	94.3%
MMMLU	91.5%	90.8%	91.2%

On coding, Opus 4.7 leads. On graduate-level reasoning (GPQA), GPT-5.4 has a hair-thin 0.2% edge. Multilingual is essentially a three-way tie. The takeaway: Opus 4.7 is the best coding model available right now, and competitive on everything else.

One efficiency note that surprised people: low-effort Opus 4.7 performs roughly like medium-effort Opus 4.6. If you’re on Opus 4.6 today and switch to 4.7 at the same effort level, you’re effectively getting a free upgrade. If you drop to a lower effort level, you save tokens while maintaining the same quality.

The developer community noticed fast. Cursor announced Opus 4.7 support within minutes of launch — with 50% off to drive adoption. Replit reported achieving the same quality at lower cost with the new model. And Poe’s platform team confirmed that “coding performance has improved meaningfully in Opus 4.7 compared to Opus 4.6.” The early consensus: this is a real upgrade, not a marketing bump.

New Feature: xhigh Effort Level

Claude has had effort levels for a while — low, medium, high, and max. They control how much “thinking” the model does before responding. More thinking = better reasoning = more tokens consumed.

Opus 4.7 adds xhigh — a new level between high and max. Think of it as “try harder than usual, but don’t go nuclear.”

Here’s the practical decision framework:

Effort	Best For	Token Cost
Low	Simple, well-defined tasks. Quick answers.	Lowest
Medium	Everyday work — bugs, features, refactoring	Standard
High	Complex debugging, multi-file refactors, architecture	Higher
xhigh	Hard problems where high wasn’t quite enough	~2x high
Max	Genuinely hard: algorithmic complexity, mysterious bugs, critical design	Highest

When is xhigh worth the tokens? When you find yourself re-prompting on high because Claude’s first answer wasn’t quite right. If a task takes three tries on high, one try on xhigh might be cheaper overall — fewer retries means fewer total tokens.

For the API: use thinking: {type: "adaptive"} with the effort parameter. Note that manual extended thinking is no longer supported in Opus 4.7 — it’s adaptive-only now. Thinking tokens are billed at output rates ($25/M).

New Feature: /ultrareview — Multi-Agent Code Review

This is the feature that developers will talk about most. /ultrareview is a new Claude Code slash command that runs a multi-agent code review on your codebase.

Instead of a single Claude instance scanning your code, /ultrareview spawns multiple specialized agents — one for security, one for logic, one for performance, one for style — and synthesizes their findings into a single report. It’s like having four senior engineers review your PR at once.

Early reactions from the developer community describe it as catching issues that single-pass review consistently misses — particularly subtle logic errors and security patterns that require cross-file reasoning.

We haven’t seen detailed benchmarks on /ultrareview specifically, so consider this a “promising but verify” feature for now.

New Feature: Task Budgets (Public Beta)

Task budgets let you set a spending cap on agentic workloads. If you’ve ever kicked off a Claude Code task and watched it run for 45 minutes consuming tokens with no stop in sight, this is the fix.

Set a budget in dollars or tokens for a task, and Claude will work within that constraint — prioritizing the highest-impact actions and stopping when the budget is consumed. Available in public beta through the API.

This matters for teams. Engineering leads who’ve been nervous about giving developers open-ended Claude Code access now have a cost control mechanism. Set a $5 budget per task, and nobody accidentally burns $50 on a runaway refactoring loop.

New Feature: 2,576px Native Vision

Opus 4.7 processes images at up to 2,576 pixels on the long edge — more than three times the capacity of previous Claude models (which topped out around 800px effective resolution).

What this means in practice: you can feed Claude a full-resolution screenshot of a web page, a dense architectural diagram, or a photo of a whiteboard, and it can read details that were previously blurred or lost. For UI review work, this is a meaningful upgrade — Claude can now spot pixel-level issues that required downscaling before.

What Didn’t Ship on Day One (But Did the Next Day)

The AI Design Tool / Builder didn’t land with Opus 4.7 on April 16. The Information reported on April 14 that Opus 4.7 would ship alongside a prompt-based design tool for websites, presentations, and product mockups. Anthropic’s day-one announcement was Opus-only.

Update — April 17: The design tool did ship, 24 hours later, as Claude Design, the first product out of Anthropic’s new “Anthropic Labs” sub-brand. It’s powered by Opus 4.7, reads your company’s source code to build a shared design system, and exports to PDF, PPTX, Canva, and standalone HTML. Research preview, Pro/Max/Team/Enterprise only.

👉 Claude Design: 8 Things It Does (and 1 It Can’t) — our hands-on review

Pricing: Unchanged

	Input	Output
Opus 4.7	$5/M tokens	$25/M tokens
Opus 4.6	$5/M tokens	$25/M tokens

Same price, better model. The one caveat: if you use xhigh effort, you’ll consume more thinking tokens (billed at the $25/M output rate). A task that cost $0.10 on high might cost $0.18-0.20 on xhigh. Whether that’s worth it depends on how many retries you save.

Where to use it: Available now on the Claude platform (claude.ai), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. If you’re on Claude Pro or Max, you already have it — Opus 4.7 replaces 4.6 automatically.

What It Can’t Do (Honest Limitations)

Let’s keep this grounded:

It’s not Mythos. Claude Mythos Preview (the cybersecurity model) remains separate and unavailable to the public. Opus 4.7 is an incremental improvement, not a leap to the next generation.
The reasoning gains are incremental, not transformative. GPQA went up 3 points. MMMLU went up 0.4 points. This is polishing, not breakthrough.
xhigh costs more and isn’t always better. On simple tasks, it just burns tokens without improving the answer. Use it for genuinely hard problems.
Context rot still applies. The 1M token window hasn’t changed, and session management still matters. More capable model, same context dynamics. (See our session management guide for the techniques that keep sessions sharp.)
No design tool on day one — but it arrived the next day. Claude Design shipped April 17, powered by Opus 4.7. See the hands-on review →

What This Means for You

If you already use Claude (Pro or Max): You have Opus 4.7 right now — it replaced 4.6 automatically. You don’t need to change anything. But try xhigh on your next hard coding problem and see if it saves you retry cycles. And try /ultrareview on a codebase you know well — it’s the fastest way to evaluate whether multi-agent review catches things you’d miss.

If you’re a developer using Claude Code daily: The 12-point CursorBench jump and 6.8-point SWE-bench jump are real. Long-running agentic tasks should feel noticeably more reliable. Set up Task budgets if you haven’t already — it’s the responsible way to run extended auto mode without surprise bills.

If you’re choosing between Claude, ChatGPT, and Gemini: Opus 4.7 is now the best coding model on the market by measurable benchmarks. GPT-5.4 still edges it on pure reasoning (by 0.2% on GPQA). Gemini 3.1 Pro is competitive everywhere but doesn’t lead anywhere. For coding work, the data says Claude. For everything else, it’s a close three-way race.

If you build products on the Claude API: Same pricing, better model — no migration needed. The xhigh effort level is worth benchmarking on your specific workload. And Task budgets in public beta should be on your evaluation list for production agentic systems.

If you’re new to AI tools: This is a good day to try Claude. The free tier gives you access to Sonnet (not Opus), but Pro ($20/month) gives you full Opus 4.7 access. If you’ve been on the fence, the gap between Claude and the alternatives just got wider on coding tasks.

The Bottom Line

Opus 4.7 is a solid upgrade, not a revolution. The coding improvements are the story — SWE-bench up 6.8 points, CursorBench up 12 points, and low-effort 4.7 matching medium-effort 4.6 for free. The new features (xhigh, /ultrareview, Task budgets, 2,576px vision) are practical additions that solve real problems.

The AI Design Tool didn’t ship on day one — but it shipped 24 hours later as Claude Design, the first product out of Anthropic’s new “Anthropic Labs” sub-brand. So the bigger-picture story ended up being true, just staggered across two days: Anthropic now ships a model AND a product category at once.

On its own merits? Opus 4.7 is the best model Anthropic has released to the public, and as of today, the strongest coding AI you can actually use.

Sources: