evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

PutnamBench

A multilingual formal theorem-proving benchmark of hand-formalized William Lowell Putnam Mathematical Competition problems in Lean 4, Isabelle, and Coq, where a model's output is scored by whether the proof assistant's compiler verifies the proof.

ReasoningProblems solvedHigher is better
ModelLabScoreSource modelStatusDate
GPT-5OpenAI28/660VerifiedAug 7, 2025Details
Gemini 2.5 ProGoogle DeepMind3VerifiedMar 25, 2025Details
o4-miniOpenAI2VerifiedApr 16, 2025Details
GPT-4oOpenAI1VerifiedMay 13, 2024Details
DeepSeek R1DeepSeek1VerifiedJan 20, 2025Details
Claude 3.7 SonnetAnthropic0VerifiedFeb 24, 2025Details
DeepSeek V3 0324DeepSeek0VerifiedMar 24, 2025Details

Each row reports the model’s Problems solved on PutnamBench. Click a row for the full run context.