evals.report
BenchmarksSourcesLabsCompareRun guides
Models
4 selected
GPT-5 highOpenAIClaude Sonnet 4.5AnthropicGemini 2.5 ProGoogle DeepMindDeepSeek R1DeepSeek
Benchmarks
2 selected
SWE-bench VerifiedCodingGPQA DiamondReasoning
BenchmarkGPT-5 highOpenAIClaude Sonnet 4.5AnthropicGemini 2.5 ProGoogle DeepMindDeepSeek R1DeepSeek
SWE-bench Verified% resolved73.6%71.3%
GPQA Diamondaccuracy86.2%85.3%
SWE-bench Verified
% resolved
GPT-5 high · OpenAI
73.6%
Claude Sonnet 4.5 · Anthropic
71.3%
Gemini 2.5 Pro · Google DeepMind
DeepSeek R1 · DeepSeek
GPQA Diamond
accuracy
GPT-5 high · OpenAI
86.2%
Claude Sonnet 4.5 · Anthropic
Gemini 2.5 Pro · Google DeepMind
85.3%
DeepSeek R1 · DeepSeek

No aggregate score is calculated. Each row uses its benchmark’s own metric. Compare rows independently.