BenchmarksReasoning
PutnamBench
A multilingual formal theorem-proving benchmark of hand-formalized William Lowell Putnam Mathematical Competition problems in Lean 4, Isabelle, and Coq, where a model's output is scored by whether the proof assistant's compiler verifies the proof.
ReasoningProblems solvedHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5 | OpenAI | 28/660 | — | Verified | Aug 7, 2025 | Details |
| Gemini 2.5 Pro | Google DeepMind | 3 | — | Verified | Mar 25, 2025 | Details |
| o4-mini | OpenAI | 2 | — | Verified | Apr 16, 2025 | Details |
| GPT-4o | OpenAI | 1 | — | Verified | May 13, 2024 | Details |
| DeepSeek R1 | DeepSeek | 1 | — | Verified | Jan 20, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 0 | — | Verified | Feb 24, 2025 | Details |
| DeepSeek V3 0324 | DeepSeek | 0 | — | Verified | Mar 24, 2025 | Details |
Each row reports the model’s Problems solved on PutnamBench. Click a row for the full run context.