evals.report
BenchmarksSourcesLabsCompareRun guides
LabsAgent systems

Agent systems

Source-reported agent or scaffold entries where the benchmark row is not a single base model.

11 models11 results

Models 11

Progress by benchmark

Show progress on
Codex CLI + GPT-5.5
Claude Code + Claude Opus 4.8
Terminus 2 + GPT-5.5
Terminus 2 + Claude Opus 4.8
Terminus 2 + Gemini 3 Pro
Gemini CLI + Gemini 3.1 Pro
Terminus 2 + Gemini 3.1 Pro
Claude Code + Claude Opus 4.7
Gemini CLI + Gemini 3 Pro
Terminus 2 + Claude Opus 4.7
Claude Code + GLM 5.1
Single benchmark only
This view shows SWE-bench Verified (% resolved) only. Other benchmarks use different metrics and are not directly comparable.

Progress matrix

ModelSWE-bench Verified
% resolved
GPQA Diamond
accuracy
LiveCodeBench Pro
Codeforces Elo
Berkeley Function Calling Leaderboard
accuracy
LiveBench
score
Terminal-Bench 2.1
task success
SWE-bench Pro
% resolved
DeepSWE
% resolved
Humanity's Last Exam
accuracy
MMMU-Pro
accuracy
LMArena
source-defined rating
ARC-AGI-3
accuracy
ARC-AGI-2
accuracy
FrontierMath
accuracy
AIME (OTIS Mock)
accuracy
SimpleQA Verified
accuracy
Codex CLI + GPT-5.5
Agent
83.4%
Claude Code + Claude Opus 4.8
Agent
78.9%
Terminus 2 + GPT-5.5
Agent
78.2%
Terminus 2 + Claude Opus 4.8
Agent
74.6%
Terminus 2 + Gemini 3 Pro
Agent
74.4%
Gemini CLI + Gemini 3.1 Pro
Agent
70.7%
Terminus 2 + Gemini 3.1 Pro
Agent
70.3%
Claude Code + Claude Opus 4.7
Agent
69.7%
Gemini CLI + Gemini 3 Pro
Agent
66.3%
Terminus 2 + Claude Opus 4.7
Agent
66.1%
Claude Code + GLM 5.1
Agent
58.7%

Scores are not normalised across benchmarks. Each column uses its own metric. Compare columns independently.