LabsAgent systems
Agent systems
Source-reported agent or scaffold entries where the benchmark row is not a single base model.
11 models11 results
Models 11
Codex CLI + GPT-5.5
Agent · codex cli gpt-5.5
—
1 results
Claude Code + Claude Opus 4.8
Agent · claude code opus 4.8
—
1 results
Terminus 2 + GPT-5.5
Agent · terminus 2 gpt-5.5
—
1 results
Terminus 2 + Claude Opus 4.8
Agent · terminus 2 opus 4.8
—
1 results
Terminus 2 + Gemini 3 Pro
Agent · terminus 2 gemini 3 pro
—
1 results
Gemini CLI + Gemini 3.1 Pro
Agent · gemini cli gemini 3.1 pro
—
1 results
Terminus 2 + Gemini 3.1 Pro
Agent · terminus 2 gemini 3.1 pro
—
1 results
Claude Code + Claude Opus 4.7
Agent · claude code opus 4.7
—
1 results
Gemini CLI + Gemini 3 Pro
Agent · gemini cli gemini 3 pro
—
1 results
Terminus 2 + Claude Opus 4.7
Agent · terminus 2 opus 4.7
—
1 results
Claude Code + GLM 5.1
Agent · claude code glm 5.1
—
1 results
Progress by benchmark
Show progress on
Codex CLI + GPT-5.5
—
Claude Code + Claude Opus 4.8
—
Terminus 2 + GPT-5.5
—
Terminus 2 + Claude Opus 4.8
—
Terminus 2 + Gemini 3 Pro
—
Gemini CLI + Gemini 3.1 Pro
—
Terminus 2 + Gemini 3.1 Pro
—
Claude Code + Claude Opus 4.7
—
Gemini CLI + Gemini 3 Pro
—
Terminus 2 + Claude Opus 4.7
—
Claude Code + GLM 5.1
—
Single benchmark only
This view shows SWE-bench Verified (% resolved) only. Other benchmarks use different metrics and are not directly comparable.
Progress matrix
| Model | SWE-bench Verified % resolved | GPQA Diamond accuracy | LiveCodeBench Pro Codeforces Elo | Berkeley Function Calling Leaderboard accuracy | LiveBench score | Terminal-Bench 2.1 task success | SWE-bench Pro % resolved | DeepSWE % resolved | Humanity's Last Exam accuracy | MMMU-Pro accuracy | LMArena source-defined rating | ARC-AGI-3 accuracy | ARC-AGI-2 accuracy | FrontierMath accuracy | AIME (OTIS Mock) accuracy | SimpleQA Verified accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Codex CLI + GPT-5.5 Agent | — | — | — | — | — | 83.4% | — | — | — | — | — | — | — | — | — | — |
| Claude Code + Claude Opus 4.8 Agent | — | — | — | — | — | 78.9% | — | — | — | — | — | — | — | — | — | — |
| Terminus 2 + GPT-5.5 Agent | — | — | — | — | — | 78.2% | — | — | — | — | — | — | — | — | — | — |
| Terminus 2 + Claude Opus 4.8 Agent | — | — | — | — | — | 74.6% | — | — | — | — | — | — | — | — | — | — |
| Terminus 2 + Gemini 3 Pro Agent | — | — | — | — | — | 74.4% | — | — | — | — | — | — | — | — | — | — |
| Gemini CLI + Gemini 3.1 Pro Agent | — | — | — | — | — | 70.7% | — | — | — | — | — | — | — | — | — | — |
| Terminus 2 + Gemini 3.1 Pro Agent | — | — | — | — | — | 70.3% | — | — | — | — | — | — | — | — | — | — |
| Claude Code + Claude Opus 4.7 Agent | — | — | — | — | — | 69.7% | — | — | — | — | — | — | — | — | — | — |
| Gemini CLI + Gemini 3 Pro Agent | — | — | — | — | — | 66.3% | — | — | — | — | — | — | — | — | — | — |
| Terminus 2 + Claude Opus 4.7 Agent | — | — | — | — | — | 66.1% | — | — | — | — | — | — | — | — | — | — |
| Claude Code + GLM 5.1 Agent | — | — | — | — | — | 58.7% | — | — | — | — | — | — | — | — | — | — |
Scores are not normalised across benchmarks. Each column uses its own metric. Compare columns independently.