evals.report
BenchmarksLabsCompareRun guides

ProgramBench

A cleanroom software-reconstruction benchmark (Meta Superintelligence Labs, Stanford, Harvard) of 200 heterogeneous tasks built from real tools like jq, ripgrep, SQLite, and FFmpeg. Given only a reference executable and its documentation — no source, no decompiling, no internet — the agent must choose a language, design the architecture, and rebuild the program, graded by ~248,000 agent-fuzzed behavioral tests (stdout, stderr, exit codes, file outputs). A task is 'resolved' only if every test passes; fully-resolved is ≤0.5% for all frontier models, so the leaderboard's effective ranking is the almost-resolved rate (tasks nearly reconstructed). Evaluated with the mini-SWE-agent harness.

Codingalmost-resolved rateHigher is better

No run guide for this benchmark yet.