evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

PutnamBench

A multilingual formal theorem-proving benchmark of hand-formalized William Lowell Putnam Mathematical Competition problems in Lean 4, Isabelle, and Coq, where a model's output is scored by whether the proof assistant's compiler verifies the proof.

ReasoningProblems solvedHigher is better

No run guide for this benchmark yet.