evals.report
BenchmarksLabsCompareRun guides

ProgramBench

A cleanroom software-reconstruction benchmark (Meta Superintelligence Labs, Stanford, Harvard) of 200 heterogeneous tasks built from real tools like jq, ripgrep, SQLite, and FFmpeg. Given only a reference executable and its documentation — no source, no decompiling, no internet — the agent must choose a language, design the architecture, and rebuild the program, graded by ~248,000 agent-fuzzed behavioral tests (stdout, stderr, exit codes, file outputs). A task is 'resolved' only if every test passes; fully-resolved is ≤0.5% for all frontier models, so the leaderboard's effective ranking is the almost-resolved rate (tasks nearly reconstructed). Evaluated with the mini-SWE-agent harness.

Codingalmost-resolved rateHigher is better

What is ProgramBench?

A cleanroom software-reconstruction benchmark (Meta Superintelligence Labs, Stanford, Harvard) of 200 heterogeneous tasks built from real tools like jq, ripgrep, SQLite, and FFmpeg. Given only a reference executable and its documentation — no source, no decompiling, no internet — the agent must choose a language, design the architecture, and rebuild the program, graded by ~248,000 agent-fuzzed behavioral tests (stdout, stderr, exit codes, file outputs). A task is 'resolved' only if every test passes; fully-resolved is ≤0.5% for all frontier models, so the leaderboard's effective ranking is the almost-resolved rate (tasks nearly reconstructed). Evaluated with the mini-SWE-agent harness. evals.report tracks reported ProgramBench scores with the model, source, status, date, and run caveats attached — official leaderboard scores, vendor-reported launches, and clearly labeled community runs.

Top reported ProgramBench score: GPT-5.5 13.5% (almost-resolved rate).

ModelLabScoreSource modelStatusDate
GPT-5.5OpenAI13.5%GPT-5.5 (xhigh)OfficialApr 23, 2026Details
Claude Opus 4.7Anthropic4.5%Claude Opus 4.7 (xhigh)OfficialApr 16, 2026Details
Claude Opus 4.6Anthropic2.5%Claude Opus 4.6OfficialFeb 5, 2026Details
Claude Sonnet 4.6Anthropic1.0%Claude Sonnet 4.6OfficialFeb 17, 2026Details
GPT-5.4OpenAI0.0%GPT-5.4OfficialMar 5, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind0.0%Gemini 3.1 ProOfficialFeb 19, 2026Details
Gemini 3 FlashGoogle DeepMind0.0%Gemini 3 FlashOfficialDec 17, 2025Details

Each row reports the model’s almost-resolved rate on ProgramBench. Click a row for the full run context.