evals.report
BenchmarksLabsCompareRun guides

PostTrainBench

Measures AI R&D automation: can a coding agent autonomously post-train (fine-tune) a base LLM to improve it? Each agent gets 4 small base models (Qwen3 1.7B, Qwen3 4B, SmolLM3-3B, Gemma 3 4B), a single H100 GPU, and a 10-hour budget to maximize each model's performance using techniques of its choosing (SFT, RL/GRPO, LoRA/QLoRA, DPO, etc.) via its native CLI scaffold (Claude Code, Codex CLI, Gemini CLI, OpenCode). The post-trained models are then evaluated with Inspect — respecting each model's generation_config.json — across 7 benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). The reported score is the weighted average across all 4 base models and 7 benchmarks. For reference, the officially-released instruct versions of the base models average 51.1% (without the 10h/1-GPU constraint) and the un-post-trained base models score 7.5% zero-shot.

Agentsweighted average scoreHigher is better

What is PostTrainBench?

Measures AI R&D automation: can a coding agent autonomously post-train (fine-tune) a base LLM to improve it? Each agent gets 4 small base models (Qwen3 1.7B, Qwen3 4B, SmolLM3-3B, Gemma 3 4B), a single H100 GPU, and a 10-hour budget to maximize each model's performance using techniques of its choosing (SFT, RL/GRPO, LoRA/QLoRA, DPO, etc.) via its native CLI scaffold (Claude Code, Codex CLI, Gemini CLI, OpenCode). The post-trained models are then evaluated with Inspect — respecting each model's generation_config.json — across 7 benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). The reported score is the weighted average across all 4 base models and 7 benchmarks. For reference, the officially-released instruct versions of the base models average 51.1% (without the 10h/1-GPU constraint) and the un-post-trained base models score 7.5% zero-shot. evals.report tracks reported PostTrainBench scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.

Top reported PostTrainBench score: Claude Opus 4.8 37.23% (weighted average score).

ModelLabScoreSource modelStatusDate
Claude Opus 4.8Anthropic37.23%Opus 4.8 (Claude Code, Max)OfficialMay 28, 2026Details
Claude Opus 4.7Anthropic28.56%Opus 4.7 (Claude Code, High)OfficialApr 16, 2026Details
GPT-5.5OpenAI25.02%GPT 5.5 (Codex CLI, xHigh)OfficialApr 23, 2026Details
Claude Opus 4.6Anthropic24.82%Opus 4.6 (1M) (Claude Code)OfficialFeb 5, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind21.59%Gemini 3.1 Pro (OpenCode)OfficialFeb 19, 2026Details
GPT-5.2OpenAI21.38%GPT-5.2 (Codex CLI)OfficialDec 11, 2025Details
GPT-5.4OpenAI20.23%GPT 5.4 (Codex CLI, High)OfficialMar 5, 2026Details
Gemini 3 ProGoogle DeepMind18.12%Gemini 3 Pro (Gemini CLI)OfficialNov 18, 2025Details
GPT-5.3-CodexOpenAI17.76%GPT 5.3 Codex (Codex CLI, High)OfficialFeb 5, 2026Details
Claude Opus 4.5Anthropic17.29%Opus 4.5 (OpenCode)OfficialNov 24, 2025Details
GPT-5.2-CodexOpenAI17.22%GPT 5.2 Codex (Codex CLI)OfficialDec 18, 2025Details
Claude Sonnet 4.6Anthropic16.42%Sonnet 4.6 (Claude Code)OfficialFeb 17, 2026Details
GLM-5Z.ai13.88%GLM 5 (OpenCode)OfficialFeb 11, 2026Details
Kimi K2.5Moonshot AI10.26%Kimi K2.5 (OpenCode)OfficialJan 27, 2026Details
Claude Sonnet 4.5Anthropic9.94%Sonnet 4.5 (Claude Code)OfficialSep 29, 2025Details
MiniMax M2.5MiniMax9.50%MiniMax M2.5 (OpenCode)OfficialFeb 12, 2026Details
MiniMax M2.1MiniMax9.33%MiniMax M2.1 (OpenCode)OfficialDec 23, 2025Details
GLM-4.7Z.ai7.48%GLM 4.7 (OpenCode)OfficialDec 22, 2025Details
Qwen3 MaxAlibaba / Qwen7.42%Qwen3 Max (Claude Code)OfficialSep 5, 2025Details
Kimi K2 ThinkingMoonshot AI7.25%Kimi K2 Thinking (OpenCode)OfficialNov 6, 2025Details

Each row reports the model’s weighted average score on PostTrainBench. Click a row for the full run context.