Benchmarks
Official eval sources grouped by task type. Click any benchmark to see scores, sources, and run guides. No cross-benchmark ranking or aggregate score is computed.
A curated SWE-bench split for evaluating systems that resolve real software engineering issues.
A difficult subset of GPQA for graduate-level science question answering evaluation.
A live competitive-programming benchmark that rates LLMs with a Codeforces-style Elo on fresh contest problems.
A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.
A frequently updated public benchmark suite spanning reasoning, coding, math, language, and instruction-following tasks.
A command-line agent benchmark for completing terminal tasks in reproducible task environments.
A harder public software-engineering agent benchmark built around professional repository tasks.
A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.
A broad expert-level academic question-answering benchmark for frontier reasoning systems.
The harder MMMU-Pro multimodal reasoning benchmark (college-level subject tasks with text and images); the variant current frontier models report.
A public chat-preference evaluation surface with source-defined preference ratings and model comparisons.
The interactive ARC-AGI-3 generalization benchmark: agents must learn novel game environments from scratch (semi-private set).
The ARC-AGI-2 abstract-reasoning puzzle benchmark (semi-private set), the harder static successor to ARC-AGI-1.
A frontier math benchmark with constrained public access and source-linked result claims.
Competition mathematics in the AIME format (Epoch AI's OTIS Mock AIME 2024-2025 set), a high-signal short-answer math reasoning benchmark.
A factual short-answer QA benchmark measuring parametric knowledge and hallucination resistance (Epoch AI's SimpleQA Verified).