evals.report
BenchmarksSourcesLabsCompareRun guides
SourcesReasoning

AIME (OTIS Mock)

Competition-math reasoning benchmark with a consistent, frequently-updated independent leaderboard across all frontier models.

NextRaw JSONStructured dataPartial run guidePublic data
Official source Benchmark page

Source detail

Score source

Epoch AI Benchmarking Hub publishes per-model mean accuracy (epoch.ai/data/benchmarks.csv).

Run guide

Problems and methodology are documented on the Epoch AI benchmarks hub.

How it can be used

Use Epoch's per-model mean accuracy; keep reasoning effort as run context.

Caveat

AIME-style benchmarks are saturating at the top; keep effort/config attached.

Evidence links 2