What does accuracy mean on AIME (OTIS Mock)?

AIME (OTIS Mock) reports accuracy (%); higher is better. Scores are shown only within AIME (OTIS Mock) and are never averaged with other benchmarks.

What is the top reported AIME (OTIS Mock) score?

GPT-5.5 Pro has the top reported score on AIME (OTIS Mock): 100.0% (accuracy).

Why do AIME (OTIS Mock) scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. AIME (OTIS Mock) scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksReasoning

AIME (OTIS Mock)

Competition mathematics in the AIME format (Epoch AI's OTIS Mock AIME 2024-2025 set), a high-signal short-answer math reasoning benchmark.

ReasoningaccuracyHigher is better

Scores About Run this benchmark

OTIS Mock AIME 2024-2025 is Epoch AI's 45-problem competition-math benchmark (integer answers 0-999) implemented as an inspect_ai task. You run it with the official UK AISI inspect_ai framework against the public (gated 'auto', free self-serve) Hugging Face dataset EpochAI/otis-mock-aime-24-25, using Epoch's published gist as the task file plus an LLM grader. Keep attached to any score: which grader model you used, the number of epochs (the gist default is 16; Epoch's per-model runs vary, e.g. its Grok 4 run used 8 samples/problem) and the reducer reported (mean vs pass@k), generation params (Epoch uses each model's API-default temperature and, for at least its Grok 4 run, up to 128k output tokens), and that you used Epoch's gist task definition (not the unrelated inspect_evals aime2024/2025 tasks, which use the real AIME).

Benchmark

AIME (OTIS Mock)

Repository

gist.github.com/tadamcz/faf4681e154be2e4c8a6579d67aca7d3

Dataset

huggingface.co/datasets/EpochAI/otis-mock-aime-24-25

Metric

accuracy

1Install

shell

pip install inspect-ai

shell

pip install huggingface_hub datasets

shell

huggingface-cli login   # required: the dataset is gated 'auto' (free, self-serve) — accept terms at https://huggingface.co/datasets/EpochAI/otis-mock-aime-24-25 first

shell

curl -sL https://gist.githubusercontent.com/tadamcz/faf4681e154be2e4c8a6579d67aca7d3/raw -o otis_mock_aime.py

2Run evaluation

shell

# One required edit to the downloaded otis_mock_aime.py before running:
# Epoch's gist imports its internal package: `from bench.model import default_grader_model`.
# That `bench` package is NOT public. Replace it with inspect_ai's documented grader-role API:
#   1. Change the import to:   from inspect_ai.model import get_model
#   2. In the scorer body, replace `grader_model = default_grader_model()` with:
#        grader_model = get_model(role="grader", default="openai/gpt-4o")
#   (Per inspect_ai docs, call get_model(role=...) INSIDE the async scorer body, not at module load. Bind the role at runtime with --model-role grader=... )

shell

# Run the eval against your model, with an explicit grader model:
export OPENAI_API_KEY=...   # and/or ANTHROPIC_API_KEY, etc. for the model under test + grader

shell

inspect eval otis_mock_aime.py --model openai/gpt-4o --model-role grader=openai/gpt-4o

shell

# To reproduce the gist's default sampling setting (16 epochs -> mean + pass@16):
inspect eval otis_mock_aime.py --model anthropic/claude-sonnet-4-0 --model-role grader=openai/gpt-4o -T epochs=16 --max-tokens 128000

3Score output

shell

# Scoring is built into the run: the gist's model_graded() scorer (metrics accuracy() + stderr()) calls grader_model.generate and checks for 'CORRECT', reduced over epochs (mean, pass_at_k). No separate scoring step.
# Inspect results in the log viewer:
inspect view

4Expected output

An inspect_ai .eval log file (under ./logs by default) containing per-sample grades (CORRECT/INCORRECT from the model grader) and the aggregate accuracy() with stderr(), reduced across epochs as `mean` and `pass_at_{epochs}`. The headline metric is accuracy on the 45 problems. View with `inspect view`. Do not compare this number to inspect_evals aime2024/aime2025 (different dataset = real AIME, 30 problems, exact-match) or to MATH Level 5 / FrontierMath.

5Submit results

There is no public submission/leaderboard API for community runs; Epoch AI populates the leaderboard at https://epoch.ai/benchmarks/otis-mock-aime-2024-2025 from its own internal runs. Report your result as a self-run number and ALWAYS attach: the model under test, the grader model used (gist uses a model grader, not exact-match), epochs (gist default 16) and the reducer reported (mean vs pass@k), temperature (Epoch uses each model's API-default) and max output tokens (Epoch's Grok 4 run used 128000), and a note that you used Epoch's official gist task definition with the grader line adapted to inspect_ai's get_model(role='grader').

Gotchas

The dataset is gated ('auto'): you must be logged in via `huggingface-cli login` AND have clicked through the access agreement on the dataset page, or hf_dataset()/load_dataset() will 401. It is free and self-serve.

The official gist imports `from bench.model import default_grader_model`, which is Epoch's private package and is NOT published. The eval will not run until you replace that with inspect_ai's `get_model(role="grader", default=...)` (called inside the scorer body) and bind --model-role grader=... — this is the documented equivalent; scoring remains model-graded, not exact-match.

Do NOT confuse this with the inspect_evals aime2024/aime2025/aime2026 tasks: those use the real AIME competition (math-ai datasets, 30 problems, exact-match), a different benchmark. OTIS Mock AIME is Epoch-specific (45 problems) and only lives in this gist.

Results are sensitive to epochs and the grader model. The gist defaults to 16 epochs and reports mean + pass@16, but Epoch's per-model runs vary (its Grok 4 run used 8 samples/problem); a single-epoch run with a weak grader will not match the leaderboard. For comparability set Epoch's generation params (API-default temperature; --max-tokens 128000 matched its Grok 4 run).