Anthropic unveils Claude 4, the "Best code model in the world"

Claude 4 Sonnet and Claude 4 opus excels in code generation and on Software Engineering tasks.

Anthropic returns to the AI ​​race for the code. The San Francisco start-up presents its new reference model this Thursday, May 22: Claude 4. The model arrives in two different versions: opus for complex tasks and sonnet for daily use. Anthropic says it: its model is the best in the world today for development tasks.

Overview: Claude 4 Model Family

The Claude 4 family includes three models:

  • Claude 4 Opus: Most powerful, highest intelligence.

  • Claude 4 Sonnet: Balanced between performance and speed.

Both Opus and Sonnet are significantly more capable than previous Claude   models.

Claude 4 opus can work independently “several hours”

Like O3 of Openai, Claude 4 Opus can use external tools (web search, code execution, MCP connector) before responding to the user. The model is designed for complex tasks, especially around development. Thanks to his reasoning, Claude 4 opus can act independently for “several hours”. It is therefore ideally designed as an agent more than a simple model.

For his part, Claude 4 Sonnet remains closer to use in mode chatbot But also excels in code and sometimes exceeds opus (especially in Software Engineering). The outperform model largely the capacities of 3.7 Sonnet, previous Sota model of Anthropic. In particular, the model manages to follow the instructions provided to it more finely and has clearer reasoning. It also excels in generation of code and generates a much clearer code than with 3.7.

Claude 4, excellent in agencies

On the benchmarks side, Claude 4 opus and Sonnet really excellent on software engineering tasks, in addition to the generation of code. SONNET is establishing new records on Swe-Bench Verified (model capacity to solve real software engineering problems) with 80.2 % against 72 % for the new OPENAI Codex-1 model or 63.2 % for Gemini 2.5 Pro.

© Anthropic

The model is also distinguished by its reasoning capacity, with 83.8% on complex reasoning tasks (GPQA Diamonds), against 66.3% for GPT-4.1 and 83% for Gemini 2.5 Pro. Finally, on the aging development part, Claude 4 opus stands out with 50%on Terminal-Bench (capacity to execute as a range of Shell commands) by significantly surpassing Gemini 2.5 Pro (25.3%) and Openai O3 (30.2%).

 

 


 Claude 4 Opus vs. Claude 4 Sonnet: Key Differences

Feature Claude 4 Opus Claude 4 Sonnet
Performance Level Flagship, highest reasoning ability Mid-tier, faster, lower cost
Use Case Complex tasks, long-context RAG, multi-step reasoning Everyday AI tasks, real-time apps
Speed Slower than Sonnet Faster response time
Cost (API) Higher Lower
Context Window 200K tokens 200K tokens
Tool Use / Vision Multimodal (with image input), high tool use capabilities Also supports tool use and vision
Benchmarks Comparable or superior to GPT-4-turbo Comparable to GPT-3.5/GPT-4
Availability Claude Pro (paid tier) Free on claude.ai and via API

 Benchmark Performance

Benchmark Claude 4 Opus Claude 4 Sonnet
MMLU (Massive Multitask Language Understanding) 86.8% ~79-81%
GPQA (Graduate-Level QA) 83.4% ~75-78%
HumanEval (Code Gen) 84.9% ~75-78%

Claude 4 Opus surpasses GPT-4 in many reasoning and math-heavy benchmarks.


 Use Case Suitability

Use Case Opus Recommended Sonnet Recommended
Scientific research and law ⚠️ (less depth)
Real-time chat assistants ⚠️ (slower)
Code generation (complex projects) ✅ (moderate)
Customer service bots ⚠️
Knowledge extraction & retrieval ✅ (faster RAG)
Long-form writing with deep logic ⚠️

 When to Use Each

  • Choose Claude 4 Opus when:

    • You need top-tier reasoning, planning, or synthesis.

    • You’re building a research assistant, code auditor, or legal analyst.

    • Speed is less important than quality and nuance.

  • Choose Claude 4 Sonnet when:

    • You need fast, affordable responses for customer interactions or content generation.

    • You’re deploying real-time applications or chatbots at scale.


🔍 Key Benchmark Comparison

Model SWE-bench (%) GPQA (%) Context Length Strengths Limitations
Claude 4 Opus 72.5 83.4 200K tokens Long-duration coding, reasoning, tool use Higher latency, premium pricing
OpenAI o3 71.7 87.7 128K tokens Chain-of-thought reasoning, math/science tasks Higher compute cost, slower responses
Gemini 2.5 Pro 63.8 ~79.7 1M tokens Large codebase handling, multimodal capabilities Lower SWE-bench score

Note: SWE-bench assesses software engineering task performance; GPQA evaluates graduate-level question answering.


🧠 Model Highlights

Claude 4 Opus

  • Performance: Achieved a leading 72.5% on SWE-bench, indicating strong coding capabilities.

  • Features: Supports extended reasoning with a 200K token context window.

  • Use Case: Excels in long-running, complex coding tasks. 

OpenAI o3

  • Performance: Scored 71.7% on SWE-bench and 87.7% on GPQA, showcasing strong reasoning abilities.

  • Features: Utilizes chain-of-thought reasoning for complex problem-solving.

  • Use Case: Ideal for tasks requiring deep reasoning and scientific understanding. 

Gemini 2.5 Pro

  • Performance: Scored 63.8% on SWE-bench, indicating solid coding performance.

  • Features: Offers a massive 1 million token context window, beneficial for large projects.

  • Use Case: Suitable for handling extensive codebases and multimodal tasks.

An unchanged pricing, always high

In terms of pricing, Claude 4 opus and Sonnet maintain relatively high prices compared to the market. OPUS is billed at $ 15 for a million tokens at entry and $ 75 output. Claude Sonnet 4 is less expensive, at 3 dollars for a million tokens at the start and $ 15 output.

However, Claude 4 remains an excellent model, especially for developers. Its ability to work continuously for several hours and its capacity in code make it a model of choice, whether for the simple generation of code or in autonomous / semi-autonomous agent mode.

Claude Code in general availability and a muscular API for agentics

Finally, Anthropic takes advantage of the announcement of Claude 4 to build its development tools. Claude Code is now accessible on general availability. The tool integrates today natively access to the depots Githublike Jules de Google or Codex of Openai. Developers can “tag” Claude Code on requests to automatically correct bugs, respond to review comments or simply modify the code.

At the same time, the anthropic API is enriched with four new capacities: an code execution tool, an MCP server connector, an access tool for local files, and the possibility of cache prompt up to an hour. The objective is clear: to give all the keys to the developers to develop agents with the SDK of Anthropic.