ProgramBench
A cleanroom software-reconstruction benchmark (Meta Superintelligence Labs, Stanford, Harvard) of 200 heterogeneous tasks built from real tools like jq, ripgrep, SQLite, and FFmpeg. Given only a reference executable and its documentation — no source, no decompiling, no internet — the agent must choose a language, design the architecture, and rebuild the program, graded by ~248,000 agent-fuzzed behavioral tests (stdout, stderr, exit codes, file outputs). A task is 'resolved' only if every test passes; fully-resolved is ≤0.5% for all frontier models, so the leaderboard's effective ranking is the almost-resolved rate (tasks nearly reconstructed). Evaluated with the mini-SWE-agent harness.
What is ProgramBench?
A cleanroom software-reconstruction benchmark (Meta Superintelligence Labs, Stanford, Harvard) of 200 heterogeneous tasks built from real tools like jq, ripgrep, SQLite, and FFmpeg. Given only a reference executable and its documentation — no source, no decompiling, no internet — the agent must choose a language, design the architecture, and rebuild the program, graded by ~248,000 agent-fuzzed behavioral tests (stdout, stderr, exit codes, file outputs). A task is 'resolved' only if every test passes; fully-resolved is ≤0.5% for all frontier models, so the leaderboard's effective ranking is the almost-resolved rate (tasks nearly reconstructed). Evaluated with the mini-SWE-agent harness. evals.report tracks reported ProgramBench scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.
Top reported ProgramBench score: GPT-5.5 — 13.5% (almost-resolved rate).
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.5 | OpenAI | 13.5% | GPT-5.5 (xhigh) | Official | Apr 23, 2026 | Details |
| Claude Opus 4.7 | Anthropic | 4.5% | Claude Opus 4.7 (xhigh) | Official | Apr 16, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 2.5% | Claude Opus 4.6 | Official | Feb 5, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 1.0% | Claude Sonnet 4.6 | Official | Feb 17, 2026 | Details |
| GPT-5.4 | OpenAI | 0.0% | GPT-5.4 | Official | Mar 5, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 0.0% | Gemini 3.1 Pro | Official | Feb 19, 2026 | Details |
| Gemini 3 Flash | Google DeepMind | 0.0% | Gemini 3 Flash | Official | Dec 17, 2025 | Details |
Each row reports the model’s almost-resolved rate on ProgramBench. Click a row for the full run context.