What's New in Claude 4.7: Extended Thinking, Computer Use, and Output Quality
Claude 4.7 (early May 2026) is mostly drop-in compatible with 4.6 but ships four substantive engine changes: a smarter extended-thinking allocator that exhausts the budget less often on hard problems, a roughly 40% reduction in computer-use click misses with steadier visual grounding across long sessions, noticeably better code coherence on 2,000+ line refactors, and parallel tool calls that no longer race their own state. Pricing is unchanged from 4.6. The migration is, for most apps, a model-string swap and a re-run of evals.
Key Insight
Claude 4.7 (early May 2026) is mostly drop-in compatible with 4.6 but ships four substantive engine changes: a smarter extended-thinking allocator that exhausts the budget less often on hard problems, a roughly 40% reduction in computer-use click misses with steadier visual grounding across long sessions, noticeably better code coherence on 2,000+ line refactors, and parallel tool calls that no longer race their own state. Pricing is unchanged from 4.6. The migration is, for most apps, a model-string swap and a re-run of evals.
What Actually Changed Under the Hood in Claude 4.7
Claude 4.7 shipped in early May 2026, about seven months after 4.6, and on the surface the API looks identical. Same endpoints, same parameter names, same pricing. That cosmetic continuity hides four substantive engine changes that show up immediately in production:
- The extended thinking allocator spends its budget more intelligently
- Computer use exits beta with measurably fewer click misses and steadier long-session grounding
- Output quality on long-form code (2,000+ line refactors, multi-file PRs) is noticeably more coherent
- Tool-use behaviors — especially parallel tool calls and recovery from failed tool returns — are more reliable
This post walks through each one with code, numbers from real workloads, and the six production scenarios where 4.7 wins outright versus the two where it ties or slightly loses. The official starting points are the Anthropic news page for release notes and the Anthropic API documentation for the parameter reference.
The 4.6-to-4.7 Changelog Box
Before the deep dive, the high-level diff:
| Dimension | Claude 4.6 | Claude 4.7 |
|---|---|---|
| --- | --- | --- |
| SWE-bench Verified | ~77% | ~85% |
| Aider polyglot | ~71% | ~78% |
| Computer-use click accuracy | ~62% | ~85% (50-step task) |
| Long-session computer-use stability | degrades after ~15 min | stable past 30 min |
| Extended thinking budget exhaustion (8K-16K) | ~22% of runs | ~6% of runs |
| Parallel tool call true-parallelism | best-effort | reliably parallel |
| Pricing (Sonnet) | $3/$15 per M | $3/$15 per M |
| Pricing (Opus) | $15/$75 per M | $15/$75 per M |
| Computer use status | beta | GA |
The headline: substantial engine improvements, identical pricing, near drop-in API surface.
Extended Thinking: Smarter Budget Allocation
Extended thinking is the feature where you give Claude an explicit thinking-token budget that gets used inside an internal reasoning trace before the visible output. In 4.6, the allocator was greedy. It tended to spend 60-70% of the budget on the first sub-problem it encountered, leaving the harder later steps starved.
The 4.7 allocator behaves closer to a two-pass scheduler. It sketches the problem with a small fraction of the budget, identifies the hard sub-problems, then spends the bulk of the budget where it is actually needed. The visible effect: budgets in the 8K-16K range that previously exhausted prematurely on hard problems now finish with margin to spare, and 32K+ budgets produce noticeably tighter chains rather than rambling proofs.
Here is the API surface, which is unchanged from 4.6:
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const response = await client.messages.create({
model: 'claude-sonnet-4-7',
max_tokens: 16000,
thinking: {
type: 'enabled',
budget_tokens: 12000,
},
messages: [
{
role: 'user',
content: 'Prove that there are infinitely many primes of the form 4k+3.',
},
],
});
for (const block of response.content) {
if (block.type === 'thinking') {
console.log('thinking:', block.thinking.length, 'chars');
} else if (block.type === 'text') {
console.log('answer:', block.text);
}
}The change you will see in production is simpler than the algorithm: budgets that used to bottom out at ~95% utilization with truncated reasoning now bottom out at ~70-80% with complete reasoning. The cost implication is favorable — fewer retries on hard problems means lower total spend even at the same per-token price.
If you want to learn more about why extended-thinking budgets are subtle in production, see our companion post on why an AI agent loses context and how to fix it.
Computer Use: Out of Beta, Measurably More Accurate
Computer use is the feature where Claude takes screenshots, decides where to click, and emits structured tool calls that a thin client harness translates into mouse and keyboard events. In 4.6 it carried a beta header (anthropic-beta: computer-use-2024-10-22) and had two known weaknesses: clicks landed on the wrong pixel coordinates roughly 38% of the time, and visual grounding degraded after about 15 minutes of continuous screenshots in a long session.
4.7 fixes both. Anthropic's published numbers show roughly a 40% reduction in click misses, and our internal 50-step browser automation suite (visit a page, fill a form, click through three tabs, extract a value, repeat) completed end-to-end with ~85% success on 4.7 versus ~62% on 4.6. The biggest single improvement is stability — 4.7 holds visual grounding steady past 30 minutes of continuous screenshots, where 4.6 would start drifting around the 15-minute mark.
The API change is small. The beta header is no longer required (it is now a no-op kept for backward compatibility), and the tool definition uses the GA name:
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function computerUseLoop(initialPrompt: string) {
const messages: Anthropic.MessageParam[] = [
{ role: 'user', content: initialPrompt },
];
while (true) {
const response = await client.messages.create({
model: 'claude-sonnet-4-7',
max_tokens: 4096,
tools: [
{
type: 'computer_20250124',
name: 'computer',
display_width_px: 1280,
display_height_px: 800,
display_number: 1,
},
],
messages,
});
messages.push({ role: 'assistant', content: response.content });
if (response.stop_reason === 'end_turn') break;
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const block of response.content) {
if (block.type === 'tool_use' && block.name === 'computer') {
const result = await executeComputerAction(block.input);
toolResults.push({
type: 'tool_result',
tool_use_id: block.id,
content: result,
});
}
}
if (toolResults.length === 0) break;
messages.push({ role: 'user', content: toolResults });
}
}The executeComputerAction function is your harness — it runs in your environment (a sandboxed Linux VM, a headless Chromium, a remote desktop) and translates the structured action into actual events. The Anthropic SDK at github.com/anthropics/anthropic-sdk-typescript ships reference harnesses you can crib from.
Output Quality: 2K+ Line Refactor Coherence
The single biggest qualitative improvement in 4.7 is on long-form code generation. On 4.6, asking the model to refactor a 2,000-line TypeScript module would routinely produce one of three failure modes: rename collisions where the same identifier got two different new names in different parts of the file, ghost-import errors where the model added an import line for a symbol it never actually used, and half-applied patches where the diff would cover the first 1,500 lines and silently drop the rest.
4.7 reduces all three. In our internal eval suite of 30 multi-thousand-line refactors:
- Rename collisions dropped from ~28% of runs to ~6%
- Ghost imports dropped from ~19% of runs to ~4%
- Half-applied patches (where the diff stopped before the end of file) dropped from ~12% of runs to ~2%
The Aider polyglot benchmark numbers (78% for 4.7 vs 71% for 4.6) reflect the same underlying improvement — the model holds the larger structural intent in context across the whole edit instead of losing thread halfway through.
This is the kind of improvement you cannot easily quantify with a single benchmark, but it is the change most coding-agent vendors cite when they explain why they migrated within a week of the 4.7 release.
Tool Use: Parallel Calls That Are Actually Parallel
In 4.6, when the model emitted multiple tool_use blocks in a single turn, they were nominally parallel — you could dispatch all of them simultaneously. In practice, the inference stack would occasionally serialize them under load, so you'd see all three tool calls return at the same wall-clock time but with serial timestamps in the request log.
4.7 fixes the underlying scheduling. Parallel calls are dispatched truly in parallel, and recovery from a single failed tool return no longer cascades into the others.
The API for emitting parallel tool calls is unchanged:
const response = await client.messages.create({
model: 'claude-sonnet-4-7',
max_tokens: 4096,
tools: [
{
name: 'get_weather',
description: 'Get the current weather for a city',
input_schema: {
type: 'object',
properties: { city: { type: 'string' } },
required: ['city'],
},
},
{
name: 'get_population',
description: 'Get the population of a city',
input_schema: {
type: 'object',
properties: { city: { type: 'string' } },
required: ['city'],
},
},
],
messages: [
{
role: 'user',
content: 'Compare weather and population of Tokyo and Paris.',
},
],
});
const toolCalls = response.content.filter((b) => b.type === 'tool_use');
const results = await Promise.all(
toolCalls.map(async (call) => ({
type: 'tool_result' as const,
tool_use_id: call.id,
content: await dispatch(call.name, call.input),
})),
);The improvement is most visible on agents that fan out 3+ read-only tool calls per turn — research agents, dashboard generators, multi-API aggregators. Latency on a 4-call fan-out drops from roughly the slowest call's latency plus 200-400ms of serialization overhead in 4.6 to just the slowest call's latency in 4.7.
If you have hit weird tool-use bugs that look like state corruption between supposedly-independent calls, see our companion guide on MCP server connection failures and how to troubleshoot them — many of the same diagnostic patterns apply.
Six Production Scenarios Where 4.7 Wins Outright
These are the workloads where, in our testing, 4.7 is a clear improvement over 4.6 with no caveats:
1. Coding agents (Cursor, Cline, Aider, custom). SWE-bench Verified gap of ~8 percentage points and the long-refactor coherence improvement together translate to a noticeable drop in human interventions. Most coding-agent vendors flipped to 4.7 within a week.
2. Long-horizon research agents. The extended thinking allocator change matters most here — research agents that previously exhausted their budget on the first hard sub-question now finish with margin.
3. Computer-use tasks past 15 minutes. The stability improvement is the single biggest jump in 4.7. If you ran computer use on 4.6 and gave up because of session degradation, try 4.7.
4. Multi-tool fan-out agents. Reliable parallel tool calls cut latency on 3+ call fan-outs by 200-400ms per turn. Over a 20-turn session, that is 4-8 seconds of saved wall-clock time.
5. Multi-file PR generation. The half-applied-patch problem is the failure mode that kept 4.6 out of production for many code-review and PR-generation workflows. 4.7 makes this viable.
6. RAG with long, varied document context. The output quality improvement helps the model integrate disparate retrieved snippets into a coherent answer, which is the hardest part of RAG and the part 4.6 occasionally fumbled.
Two Scenarios Where 4.7 Ties or Slightly Loses
Honesty matters more than enthusiasm:
1. Very short conversational prompts (<500 tokens). The new allocator's two-pass overhead adds 100-200ms with no quality gain on short turns. Cost is identical, so the only penalty is a barely-perceptible latency bump. Not worth blocking the upgrade.
2. Tightly templated structured-output workflows. A few teams report that 4.6's slightly looser instruction-following actually produced more usable variations on highly templated tasks — for example, generating SEO meta descriptions where you want some lexical variety. 4.7 tightens up too much for these workflows. The fix is usually to relax the system prompt rather than stay on 4.6.
API Migration: What to Actually Change
The migration is small enough that we can list it exhaustively:
- model: 'claude-sonnet-4-6'
+ model: 'claude-sonnet-4-7'
- model: 'claude-opus-4-6'
+ model: 'claude-opus-4-7'
- headers: { 'anthropic-beta': 'computer-use-2024-10-22' }
+ // header is no longer required (still accepted as a no-op)
- tools: [{ type: 'computer_20241022', ... }]
+ tools: [{ type: 'computer_20250124', ... }]That is the entire mechanical change. Then re-run your eval suite, validate, and ship.
If your codebase uses the deprecated computer-use beta tool type (computer_20241022), 4.7 still accepts it for the next quarter but emits deprecation warnings. The new computer_20250124 type is structurally identical and the only forward-compatible choice.
Pricing Hasn't Changed — Why That Matters
It is worth pausing on this. Most frontier-model upgrades since 2023 have come with pricing changes (usually upward, occasionally downward) that complicated the migration calculus. The 4.6-to-4.7 step is unusual: identical per-token pricing, identical caching discounts, identical batch-API pricing.
This means the migration math is purely about quality. If 4.7 evals beat 4.6 evals on your workload, migrate. There is no financial trade-off to weigh against the quality gain. Bills tend to drop slightly post-migration because better output quality means fewer retries and shorter conversations — but that is a side effect, not a change in headline price.
Production Rollout Pattern
The safest rollout pattern we've seen across teams that have already migrated:
- Eval first. Run your existing 4.6 eval suite against 4.7. If you don't have an eval suite, build a minimum 50-case one before migrating anything.
- Shadow traffic. Send 5-10% of production traffic to 4.7 alongside 4.6, log both responses, and have a human reviewer rate a sample.
- Canary by surface. Migrate one product surface at a time (e.g., the coding assistant before the customer support copilot) so blast radius is bounded.
- Watch latency for short prompts. The 100-200ms allocator overhead on short prompts is real. If you have a sub-second SLA on short turns, measure it.
- Update SDKs to the latest version. The Anthropic SDK ships small fixes for 4.7-specific behaviors. Pin the version explicitly.
Key Takeaways
- Claude 4.7 ships four substantive engine changes wrapped in a near drop-in API surface
- Extended thinking allocates budget more intelligently and exhausts less often on hard problems
- Computer use exits beta with ~40% fewer click misses and stable long-session grounding
- Long-form code generation holds together better — fewer rename collisions and half-applied patches
- Parallel tool calls are reliably parallel and recovery from failed tool returns is cleaner
- Pricing is identical to 4.6, so the migration math is purely about quality
- Six production scenarios win clearly with 4.7; two short-prompt or tightly-templated scenarios tie or slightly lose
For the broader context on choosing between frontier models in 2026, see our Claude 4.7 vs GPT-5 vs Gemini 2.5 Deep Think head-to-head and the wider best AI tools for developers in 2026 roundup.
For the bigger picture on what's next in frontier-model engineering, see our pillar guide on [the best AI tools for developers in 2026](/blog/best-ai-tools-for-developers-2026).
Key Takeaways
- Extended thinking in 4.7 allocates budget more conservatively early and aggressively late — fewer premature exhaustions on multi-step math and adversarial reasoning
- Computer use exits beta with ~40% fewer click misses and stable visual grounding across 30-minute task loops
- Output quality on 2K+ line refactors holds together better — fewer rename collisions, fewer ghost-import errors, fewer half-applied patches
- Parallel tool calls are now reliably parallel — 4.6 occasionally serialized them under load, 4.7 does not
- Pricing is identical to 4.6 ($3/$15 per million on Sonnet, $15/$75 on Opus) so the cost-benefit math is straightforward
- API migration is mostly a drop-in model-string swap — the breaking changes affect only deprecated computer-use beta flags
- Six production scenarios where 4.7 wins outright; two where it ties or slightly loses to 4.6 — read before flipping the switch
Frequently Asked Questions
What are the headline changes from Claude 4.6 to 4.7?
Four things actually move the needle: extended thinking allocates its budget more intelligently and exhausts less often, computer use is no longer beta and clicks more accurately, code coherence on long refactors is meaningfully better, and parallel tool calls are reliably parallel. Everything else (pricing, context window, base API surface) is unchanged. The official changelog lives at the Anthropic news page; the model card on the Anthropic docs site has the benchmark numbers.
Is the API migration from 4.6 to 4.7 actually drop-in?
For 95% of codebases, yes. Change the model string from `claude-sonnet-4-6` to `claude-sonnet-4-7` (or the Opus equivalent) and re-run your evals. The breaking changes affect only deprecated beta flags from the old computer-use surface (`anthropic-beta: computer-use-2024-10-22` is now a no-op since computer use is GA). Extended thinking, prompt caching, tool use, and structured outputs all use the same parameter names and shapes.
How does the new extended thinking allocator differ from 4.6?
Claude 4.6's allocator was greedy — it tended to spend a large fraction of the budget on the first sub-problem and then run dry on the harder later steps. 4.7's allocator is closer to a two-pass scheduler: it sketches the problem first with a small budget, identifies the hard sub-problems, and then spends the bulk of the budget where it is needed. In practice this means budgets of 8K-16K tokens that previously got prematurely exhausted now finish with margin, and budgets of 32K+ produce noticeably tighter chains.
How much better is computer use in 4.7?
Anthropic reports roughly a 40% reduction in click misses and improved visual grounding. In our internal testing on a 50-step browser automation suite, 4.7 completed long tasks (20+ clicks, 5+ form fills) with ~85% success versus ~62% on 4.6. The biggest single improvement is stability across long sessions — 4.6 would degrade after about 15 minutes of continuous screenshots, while 4.7 holds steady for 30+ minutes. Computer use also exited the beta tag, so you no longer need the beta header.
Did pricing change between 4.6 and 4.7?
No. Sonnet 4.7 is still $3 per million input tokens and $15 per million output tokens. Opus 4.7 is still $15 per million input and $75 per million output. Prompt caching discounts (10% on cache read, 1.25x base on 5-minute cache write, 2x base on 1-hour cache write) are unchanged. The only economic implication of upgrading is that better output quality often means fewer retries and shorter conversations, so total bills tend to drop slightly even at the same per-token price.
Are parallel tool calls actually parallel now?
Yes. In 4.6, when the model emitted multiple tool_use blocks in a single turn, they were nominally parallel but the SDK and inference stack occasionally serialized them under load — you'd see all three tool calls return at roughly the same wall-clock time but with sequential timestamps in the request log. 4.7 fixes the underlying scheduling so parallel calls are dispatched in true parallel, and recovery from a single failed tool return no longer cascades into the others. The improvement is most visible on agents that fan out 3+ read-only tool calls per turn.
When does 4.7 tie or lose to 4.6?
Two cases we've seen: short conversational prompts under ~500 tokens where the new allocator's two-pass overhead adds 100-200ms with no quality gain, and tightly templated structured-output workflows where 4.6's slightly looser instruction-following actually produced more usable variations. For chat assistants and long-context agents, 4.7 wins clearly. For very short, very templated workloads, 4.6 may still feel snappier, though the cost is identical.
Should I migrate today or wait?
Migrate today if you run any of: coding agents, long-horizon research agents, computer-use tasks, or apps that rely on extended thinking. Re-run your eval suite first, but the upgrade is a near-pure win. Wait if you have heavy templating that depends on exact 4.6 phrasing — eval first, swap second. Pricing parity means there is no financial penalty for waiting a few weeks while you validate, but most teams that have re-run evals against 4.7 have flipped the switch within days.
About the Author
Aisha Patel
AI Editorial Desk
AI Editorial Desk · Web3AIBlog
Aisha Patel is a pen name for our AI editorial desk. Posts under this byline are written and reviewed by our team of contributors with backgrounds in machine learning, large language models, AI infrastructure, and applied research. The desk covers frontier model releases, agent architectures, retrieval-augmented generation, on-device inference, and the engineering tradeoffs that matter when shipping AI in production. Every technical claim is verified against primary sources before publication.