evals.report
BenchmarksLabsCompareRun guides

ProgramBench

A cleanroom software-reconstruction benchmark (Meta Superintelligence Labs, Stanford, Harvard) of 200 heterogeneous tasks built from real tools like jq, ripgrep, SQLite, and FFmpeg. Given only a reference executable and its documentation — no source, no decompiling, no internet — the agent must choose a language, design the architecture, and rebuild the program, graded by ~248,000 agent-fuzzed behavioral tests (stdout, stderr, exit codes, file outputs). A task is 'resolved' only if every test passes; fully-resolved is ≤0.5% for all frontier models, so the leaderboard's effective ranking is the almost-resolved rate (tasks nearly reconstructed). Evaluated with the mini-SWE-agent harness.

Codingalmost-resolved rateHigher is better

What this benchmark measures

A cleanroom software-reconstruction benchmark (Meta Superintelligence Labs, Stanford, Harvard) of 200 heterogeneous tasks built from real tools like jq, ripgrep, SQLite, and FFmpeg. Given only a reference executable and its documentation — no source, no decompiling, no internet — the agent must choose a language, design the architecture, and rebuild the program, graded by ~248,000 agent-fuzzed behavioral tests (stdout, stderr, exit codes, file outputs). A task is 'resolved' only if every test passes; fully-resolved is ≤0.5% for all frontier models, so the leaderboard's effective ranking is the almost-resolved rate (tasks nearly reconstructed). Evaluated with the mini-SWE-agent harness.

Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.

The metric shown here is almost-resolved rate. It should be interpreted within ProgramBench, not compared as part of a site-wide ranking.

What to be careful about

Scores are from the official ProgramBench leaderboard (mini-SWE-agent harness). Fully-resolved is ≤0.5% for every frontier model — the benchmark is effectively unsolved — so rows are ranked by the almost-resolved rate the leaderboard uses as its tiebreaker. GLM-5.2 and Opus 4.8 are not on the official leaderboard; vendor self-reports under different harnesses aren't comparable and are excluded.

No composite ranking
evals.report never combines benchmarks. almost-resolved rate on ProgramBench is its own number — don’t average it with other metrics.

Frequently asked

What is ProgramBench?

A cleanroom software-reconstruction benchmark (Meta Superintelligence Labs, Stanford, Harvard) of 200 heterogeneous tasks built from real tools like jq, ripgrep, SQLite, and FFmpeg. Given only a reference executable and its documentation — no source, no decompiling, no internet — the agent must choose a language, design the architecture, and rebuild the program, graded by ~248,000 agent-fuzzed behavioral tests (stdout, stderr, exit codes, file outputs). A task is 'resolved' only if every test passes; fully-resolved is ≤0.5% for all frontier models, so the leaderboard's effective ranking is the almost-resolved rate (tasks nearly reconstructed). Evaluated with the mini-SWE-agent harness. It is a coding benchmark measured by almost-resolved rate.

What does almost-resolved rate mean on ProgramBench?

ProgramBench reports almost-resolved rate (%); higher is better. Scores are shown only within ProgramBench and are never averaged with other benchmarks.

What is the top reported ProgramBench score?

GPT-5.5 has the top reported score on ProgramBench: 13.5% (almost-resolved rate).

Why do ProgramBench scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. ProgramBench scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".