Reference
Compare models
Select models and benchmarks. Scores are shown exactly within each benchmark’s own metric — never combined into an aggregate, average, or winner.
Models
4 selected
GPT-5 highOpenAIClaude Sonnet 4.5AnthropicGemini 2.5 ProGoogle DeepMindDeepSeek R1DeepSeek
Benchmarks
2 selected
SWE-bench VerifiedCodingGPQA DiamondReasoning
| Benchmark | GPT-5 highOpenAI | Claude Sonnet 4.5Anthropic | Gemini 2.5 ProGoogle DeepMind | DeepSeek R1DeepSeek |
|---|---|---|---|---|
| SWE-bench Verified% resolved | 73.6% | 71.3% | — | — |
| GPQA Diamondaccuracy | 86.2% | — | 85.3% | — |
SWE-bench Verified
% resolved
GPT-5 high · OpenAI
73.6%
Claude Sonnet 4.5 · Anthropic
71.3%
Gemini 2.5 Pro · Google DeepMind
—
DeepSeek R1 · DeepSeek
—
GPQA Diamond
accuracy
GPT-5 high · OpenAI
86.2%
Claude Sonnet 4.5 · Anthropic
—
Gemini 2.5 Pro · Google DeepMind
85.3%
DeepSeek R1 · DeepSeek
—
No aggregate score is calculated. Each row uses its benchmark’s own metric. Compare rows independently.