PostTrainBench
Measures AI R&D automation: can a coding agent autonomously post-train (fine-tune) a base LLM to improve it? Each agent gets 4 small base models (Qwen3 1.7B, Qwen3 4B, SmolLM3-3B, Gemma 3 4B), a single H100 GPU, and a 10-hour budget to maximize each model's performance using techniques of its choosing (SFT, RL/GRPO, LoRA/QLoRA, DPO, etc.) via its native CLI scaffold (Claude Code, Codex CLI, Gemini CLI, OpenCode). The post-trained models are then evaluated with Inspect — respecting each model's generation_config.json — across 7 benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). The reported score is the weighted average across all 4 base models and 7 benchmarks. For reference, the officially-released instruct versions of the base models average 51.1% (without the 10h/1-GPU constraint) and the un-post-trained base models score 7.5% zero-shot.
What is PostTrainBench?
Measures AI R&D automation: can a coding agent autonomously post-train (fine-tune) a base LLM to improve it? Each agent gets 4 small base models (Qwen3 1.7B, Qwen3 4B, SmolLM3-3B, Gemma 3 4B), a single H100 GPU, and a 10-hour budget to maximize each model's performance using techniques of its choosing (SFT, RL/GRPO, LoRA/QLoRA, DPO, etc.) via its native CLI scaffold (Claude Code, Codex CLI, Gemini CLI, OpenCode). The post-trained models are then evaluated with Inspect — respecting each model's generation_config.json — across 7 benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). The reported score is the weighted average across all 4 base models and 7 benchmarks. For reference, the officially-released instruct versions of the base models average 51.1% (without the 10h/1-GPU constraint) and the un-post-trained base models score 7.5% zero-shot. evals.report tracks reported PostTrainBench scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.
Top reported PostTrainBench score: Claude Opus 4.8 — 37.23% (weighted average score).
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | 37.23% | Opus 4.8 (Claude Code, Max) | Official | May 28, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 28.56% | Opus 4.7 (Claude Code, High) | Official | Apr 16, 2026 | Details |
| GPT-5.5 | OpenAI | 25.02% | GPT 5.5 (Codex CLI, xHigh) | Official | Apr 23, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 24.82% | Opus 4.6 (1M) (Claude Code) | Official | Feb 5, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 21.59% | Gemini 3.1 Pro (OpenCode) | Official | Feb 19, 2026 | Details |
| GPT-5.2 | OpenAI | 21.38% | GPT-5.2 (Codex CLI) | Official | Dec 11, 2025 | Details |
| GPT-5.4 | OpenAI | 20.23% | GPT 5.4 (Codex CLI, High) | Official | Mar 5, 2026 | Details |
| Gemini 3 Pro | Google DeepMind | 18.12% | Gemini 3 Pro (Gemini CLI) | Official | Nov 18, 2025 | Details |
| GPT-5.3-Codex | OpenAI | 17.76% | GPT 5.3 Codex (Codex CLI, High) | Official | Feb 5, 2026 | Details |
| Claude Opus 4.5 | Anthropic | 17.29% | Opus 4.5 (OpenCode) | Official | Nov 24, 2025 | Details |
| GPT-5.2-Codex | OpenAI | 17.22% | GPT 5.2 Codex (Codex CLI) | Official | Dec 18, 2025 | Details |
| Claude Sonnet 4.6 | Anthropic | 16.42% | Sonnet 4.6 (Claude Code) | Official | Feb 17, 2026 | Details |
| GLM-5 | Z.ai | 13.88% | GLM 5 (OpenCode) | Official | Feb 11, 2026 | Details |
| Kimi K2.5 | Moonshot AI | 10.26% | Kimi K2.5 (OpenCode) | Official | Jan 27, 2026 | Details |
| Claude Sonnet 4.5 | Anthropic | 9.94% | Sonnet 4.5 (Claude Code) | Official | Sep 29, 2025 | Details |
| MiniMax M2.5 | MiniMax | 9.50% | MiniMax M2.5 (OpenCode) | Official | Feb 12, 2026 | Details |
| MiniMax M2.1 | MiniMax | 9.33% | MiniMax M2.1 (OpenCode) | Official | Dec 23, 2025 | Details |
| GLM-4.7 | Z.ai | 7.48% | GLM 4.7 (OpenCode) | Official | Dec 22, 2025 | Details |
| Qwen3 Max | Alibaba / Qwen | 7.42% | Qwen3 Max (Claude Code) | Official | Sep 5, 2025 | Details |
| Kimi K2 Thinking | Moonshot AI | 7.25% | Kimi K2 Thinking (OpenCode) | Official | Nov 6, 2025 | Details |
Each row reports the model’s weighted average score on PostTrainBench. Click a row for the full run context.