evals.report
BenchmarksLabsCompareRun guides

PostTrainBench

Measures AI R&D automation: can a coding agent autonomously post-train (fine-tune) a base LLM to improve it? Each agent gets 4 small base models (Qwen3 1.7B, Qwen3 4B, SmolLM3-3B, Gemma 3 4B), a single H100 GPU, and a 10-hour budget to maximize each model's performance using techniques of its choosing (SFT, RL/GRPO, LoRA/QLoRA, DPO, etc.) via its native CLI scaffold (Claude Code, Codex CLI, Gemini CLI, OpenCode). The post-trained models are then evaluated with Inspect — respecting each model's generation_config.json — across 7 benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). The reported score is the weighted average across all 4 base models and 7 benchmarks. For reference, the officially-released instruct versions of the base models average 51.1% (without the 10h/1-GPU constraint) and the un-post-trained base models score 7.5% zero-shot.

Agentsweighted average scoreHigher is better

No run guide for this benchmark yet.