SourcesReasoning
AIME (OTIS Mock)
Competition-math reasoning benchmark with a consistent, frequently-updated independent leaderboard across all frontier models.
NextRaw JSONStructured dataPartial run guidePublic data
Source detail
Score source
Epoch AI Benchmarking Hub publishes per-model mean accuracy (epoch.ai/data/benchmarks.csv).
Run guide
Problems and methodology are documented on the Epoch AI benchmarks hub.
How it can be used
Use Epoch's per-model mean accuracy; keep reasoning effort as run context.
Caveat
AIME-style benchmarks are saturating at the top; keep effort/config attached.