PostTrainBench
Measures AI R&D automation: can a coding agent autonomously post-train (fine-tune) a base LLM to improve it? Each agent gets 4 small base models (Qwen3 1.7B, Qwen3 4B, SmolLM3-3B, Gemma 3 4B), a single H100 GPU, and a 10-hour budget to maximize each model's performance using techniques of its choosing (SFT, RL/GRPO, LoRA/QLoRA, DPO, etc.) via its native CLI scaffold (Claude Code, Codex CLI, Gemini CLI, OpenCode). The post-trained models are then evaluated with Inspect — respecting each model's generation_config.json — across 7 benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). The reported score is the weighted average across all 4 base models and 7 benchmarks. For reference, the officially-released instruct versions of the base models average 51.1% (without the 10h/1-GPU constraint) and the un-post-trained base models score 7.5% zero-shot.
What this benchmark measures
Measures AI R&D automation: can a coding agent autonomously post-train (fine-tune) a base LLM to improve it? Each agent gets 4 small base models (Qwen3 1.7B, Qwen3 4B, SmolLM3-3B, Gemma 3 4B), a single H100 GPU, and a 10-hour budget to maximize each model's performance using techniques of its choosing (SFT, RL/GRPO, LoRA/QLoRA, DPO, etc.) via its native CLI scaffold (Claude Code, Codex CLI, Gemini CLI, OpenCode). The post-trained models are then evaluated with Inspect — respecting each model's generation_config.json — across 7 benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). The reported score is the weighted average across all 4 base models and 7 benchmarks. For reference, the officially-released instruct versions of the base models average 51.1% (without the 10h/1-GPU constraint) and the un-post-trained base models score 7.5% zero-shot.
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is weighted average score. It should be interpreted within PostTrainBench, not compared as part of a site-wide ranking.
What to be careful about
Research benchmark from ELLIS Institute Tübingen, MPI for Intelligent Systems, the University of Tübingen, and Thoughtful Lab (arXiv:2603.08640). Scores are the weighted average across 4 base models × 7 benchmarks; rows here take each model's best-scoring AUTONOMOUS run. Reprompted runs (an operator manually prompts the agent to continue when it stops early) are not fully autonomous and are excluded from the headline figure, noted per row. The authors documented contamination / reward-hacking flags across several agents (loading eval sets as training data, embedding eval-format questions as synthetic data); reproducibility depends on proprietary model APIs and exact scaffold versions. The official-instruct and base-model baselines on the leaderboard are reference points, not agent submissions, and are not included here.
Frequently asked
What is PostTrainBench?
Measures AI R&D automation: can a coding agent autonomously post-train (fine-tune) a base LLM to improve it? Each agent gets 4 small base models (Qwen3 1.7B, Qwen3 4B, SmolLM3-3B, Gemma 3 4B), a single H100 GPU, and a 10-hour budget to maximize each model's performance using techniques of its choosing (SFT, RL/GRPO, LoRA/QLoRA, DPO, etc.) via its native CLI scaffold (Claude Code, Codex CLI, Gemini CLI, OpenCode). The post-trained models are then evaluated with Inspect — respecting each model's generation_config.json — across 7 benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). The reported score is the weighted average across all 4 base models and 7 benchmarks. For reference, the officially-released instruct versions of the base models average 51.1% (without the 10h/1-GPU constraint) and the un-post-trained base models score 7.5% zero-shot. It is a agents benchmark measured by weighted average score.
What does weighted average score mean on PostTrainBench?
PostTrainBench reports weighted average score (%); higher is better. Scores are shown only within PostTrainBench and are never averaged with other benchmarks.
What is the top reported PostTrainBench score?
Claude Opus 4.8 has the top reported score on PostTrainBench: 37.23% (weighted average score).
Why do PostTrainBench scores differ across runs?
Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.
Does evals.report rank models across benchmarks?
No. PostTrainBench scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".