Benchmarks

Terminal-Bench 2.1Agents/task success/↑ higher

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

github.com/laude-institute/terminal-bench29 reportedguide available

91.9%

Top reported · GPT-5.6 Sol Ultra

DeepSWECoding/% resolved/↑ higher

A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.

github.com/datacurve-ai/deep-swe20 reportedguide available

70.05%

GPQA DiamondReasoning/accuracy/↑ higher

A difficult subset of GPQA for graduate-level science question answering evaluation.

github.com/idavidrein/gpqa81 reportedguide available

95.5%

Top reported · Fugu

LiveCodeBench ProCoding/Codeforces Elo/↑ higher

A live competitive-programming benchmark that rates LLMs with a Codeforces-style Elo on fresh contest problems.

github.com/GavinZhengOI/LiveCodeBench-Pro22 reportedguide available

3298

Top reported · Gemini 3 Deep Think

Humanity's Last ExamReasoning/accuracy/↑ higher

A broad expert-level academic question-answering benchmark for frontier reasoning systems.

47 reportedguide available

56.8%

LiveBenchReasoning/score/↑ higher

A frequently updated public benchmark suite spanning reasoning, coding, math, language, and instruction-following tasks.

github.com/LiveBench/LiveBench16 reportedguide available

80.71%

SWE-bench ProCoding/% resolved/↑ higher

A harder public software-engineering agent benchmark built around professional repository tasks.

github.com/scaleapi/SWE-bench_Pro-os50 reportedguide available

80.0%

Berkeley Function Calling LeaderboardTool use/accuracy/↑ higher

A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.

github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard22 reportedguide available

77.47%

Top reported · Claude Opus 4.5

MMMU-ProMultimodal/accuracy/↑ higher

The harder MMMU-Pro multimodal reasoning benchmark (college-level subject tasks with text and images); the variant current frontier models report.

github.com/MMMU-Benchmark/MMMU24 reportedguide available

83.0%

Top reported · GPT-5.6 Sol

LMArenaChat preference/source-defined rating/↑ higher

A public chat-preference evaluation surface with source-defined preference ratings and model comparisons.

41 reported

1499

ARC-AGI-1Reasoning/accuracy/↑ higher

The original ARC-AGI-1 abstract-reasoning puzzle benchmark (semi-private set): few-shot grid transformations that are easy for humans but resist memorization. Largely cleared by 2026 frontier reasoning models, which is what motivated the harder ARC-AGI-2.

29 reported

98%

ARC-AGI-2Reasoning/accuracy/↑ higher

The ARC-AGI-2 abstract-reasoning puzzle benchmark (semi-private set), the harder static successor to ARC-AGI-1.

29 reportedguide available

85%

ARC-AGI-3Reasoning/accuracy/↑ higher

The interactive ARC-AGI-3 generalization benchmark: agents must learn novel game environments from scratch (semi-private set).

10 reportedguide available

7.78%

Top reported · GPT-5.6 Sol

FrontierMathReasoning/accuracy/↑ higher

A frontier math benchmark with constrained public access and source-linked result claims.

44 reported

52.4%

AIME (OTIS Mock)Reasoning/accuracy/↑ higher

Competition mathematics in the AIME format (Epoch AI's OTIS Mock AIME 2024-2025 set), a high-signal short-answer math reasoning benchmark.

48 reportedguide available

100.0%

SimpleQA VerifiedOther/accuracy/↑ higher

A factual short-answer QA benchmark measuring parametric knowledge and hallucination resistance (Epoch AI's SimpleQA Verified).

36 reportedguide available

77.3%

GBA EvalCoding/overall score/↑ higher

Frontier coding agents get 24 hours to write a complete Game Boy Advance emulator (Rust + WebAssembly) from scratch, graded against the Mesen2 reference emulator.

github.com/mechanize-work/gba-eval9 reportedguide available

70.9%

WeirdMLCoding/average accuracy/↑ higher

Tests whether LLMs can do machine learning on novel, unusual datasets: each model writes and iteratively debugs PyTorch code over 5 feedback rounds in a sandboxed GPU container, scored on held-out test accuracy across 17 tasks (6 public, 13 hidden).

44 reported

84.9%

MCP AtlasTool use/pass rate/↑ higher

Scale AI's large-scale tool-use benchmark: 1,000 expert-written natural-language tasks over 36 real Model Context Protocol (MCP) servers and 220+ tools, requiring agents to discover and orchestrate multi-step tool calls; scored by pass rate via an LLM judge.

github.com/scaleapi/mcp-atlas17 reportedguide available

88.1%

Top reported · Muse Spark 1.1

Remote Labor IndexAgents/automation rate/↑ higher

The Remote Labor Index (RLI), from CAIS and Scale Labs, measures how often AI agents can complete real, economically valuable freelance projects (3D & CAD, architecture, graphic design, video, audio, data analysis, web apps, and more) at a quality a paying client would accept. Each of the 240 projects has a real client brief, input files, and a gold-standard deliverable from a paid professional; every AI deliverable is judged by human evaluators. The headline automation rate is the share of projects where the AI's work is judged at least as good as the human's.

8 reported

16.1%

Artificial Analysis Intelligence IndexReasoning/Index/↑ higher

A composite intelligence score (AAII v4.0) that aggregates a model's performance across 10 challenging evaluations spanning reasoning, knowledge, coding, agentic tasks, and instruction-following (GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, and CritPt) into a single ~0–100 index.

58 reported

Epoch Capabilities IndexReasoning/Index/↑ higher

A composite capability index from Epoch AI that statistically stitches together scores from 40+ benchmarks (using an Item-Response-Theory-style model) into a single saturation-resistant general-capability scale, calibrated so Claude 3.5 Sonnet=130 and GPT-5=150.

github.com/epoch-research/eci-public60 reported

159.3

Aider PolyglotCoding/% correct/↑ higher

A coding benchmark that measures how reliably an LLM can solve and apply diff-based code edits across 225 challenging Exercism exercises spanning C++, Go, Java, JavaScript, Python, and Rust, with up to two attempts per problem.

github.com/Aider-AI/polyglot-benchmark24 reported

89.4%

Top reported · Claude Opus 4.5

SWE-rebenchCoding/Resolved rate (pass@1)/↑ higher

A continuously updated, contamination-free agentic software-engineering benchmark from Nebius that mines fresh post-cutoff GitHub issue/PR tasks and evaluates LLM agents under a fixed ReAct scaffold, reporting the monthly decontaminated resolved rate.

github.com/SWE-rebench/SWE-bench-fork11 reported

65.3%

MMLU-ProReasoning/accuracy/↑ higher

A more robust and challenging successor to MMLU with over 12,000 reasoning-focused questions across 14 subjects, expanding answer choices from four to ten to better discriminate frontier large language models.

github.com/TIGER-AI-Lab/MMLU-Pro42 reported

90.99%

OSWorldAgents/task success rate/↑ higher

OSWorld benchmarks multimodal AI agents on their ability to complete open-ended, real-world computer-use tasks (operating GUIs across web, files, and applications) in live operating-system environments via screenshots and mouse/keyboard control, measured by execution-based task success rate.

github.com/xlang-ai/OSWorld18 reported

85.0%

GAIA: A Benchmark for General AI AssistantsAgents/accuracy/↑ higher

GAIA is a benchmark of 450+ real-world questions requiring multi-step reasoning, web browsing, multi-modality handling, and tool use, designed to be easy for humans (~92%) but hard for AI assistants, scored across three difficulty levels.

huggingface.co/datasets/gaia-benchmark/GAIA29 reported

74.55%

Top reported · Claude Sonnet 4.5

BrowseCompAgents/accuracy/↑ higher

A benchmark of 1,266 hard-to-find, multi-hop web-browsing questions whose answers are difficult to locate but easy to verify, measuring an agent's ability to persistently search and synthesize information from the web.

github.com/openai/simple-evals15 reported

92.2%

Top reported · GPT-5.6 Sol Ultra

τ²-bench (Telecom)Tool use/pass^1/↑ higher

A dual-control, multi-turn tool-agent-user benchmark (telecom split) where both the AI agent and a simulated user invoke tools to coordinate and resolve technical-support troubleshooting tasks in a shared, dynamic environment.

github.com/sierra-research/tau2-bench35 reported

99.3%

AIME 2026Reasoning/accuracy/↑ higher

Accuracy of LLMs on the 30 problems of the 2026 American Invitational Mathematics Examination (AIME I and II), a contamination-free competition-math benchmark requiring integer answers (0-999), evaluated live by MathArena.

github.com/eth-sri/matharena22 reported

100.00%

MathVistaMultimodal/accuracy/↑ higher

A benchmark of 6,141 examples (evaluated on the 1,000-example testmini split) that measures mathematical reasoning in visual contexts, spanning figure QA, geometry, math word problems, textbook QA, and visual QA, reported as answer accuracy.

github.com/lupantech/MathVista9 reported

86.8%

Top reported · o3

Video-MMEMultimodal/accuracy/↑ higher

A comprehensive evaluation benchmark for multimodal LLMs in video analysis, using 900 videos (254 hours) and 2,700 human-annotated multiple-choice QA pairs across short, medium, and long durations, scored by answer accuracy with and without subtitles.

github.com/MME-Benchmarks/Video-MME6 reported

87.4%

Top reported · Kimi K2.5

GDPvalAgents/Elo/↑ higher

GDPval evaluates AI models agentically (shell + web access via a sandbox harness) on real-world economically valuable knowledge-work deliverables — documents, spreadsheets, slides, diagrams — spanning 44 occupations across 9 major U.S. GDP industries, scored by blind pairwise quality comparison; the Artificial Analysis GDPval-AA variant reports results as an Elo rating.

huggingface.co/datasets/openai/gdpval70 reported

1932

LiveCodeBenchCoding/Pass@1/↑ higher

A holistic, contamination-free benchmark that continuously collects new competitive-programming problems from LeetCode, AtCoder, and Codeforces (released after model training cutoffs) and measures code-generation correctness via Pass@1.

github.com/LiveCodeBench/LiveCodeBench40 reported

93.2%

Top reported · Fugu Ultra

METR Task-Completion Time HorizonsAgents/50% time horizon/↑ higher

Measures the length of software/ML-engineering tasks (in human-expert minutes) that an AI agent can complete with 50% reliability, derived from a logistic fit over HCAST, RE-Bench, and SWAA task suites.

github.com/METR/eval-analysis-public18 reported

1044.8 min

SciCodeCoding/accuracy/↑ higher

A scientist-curated benchmark that evaluates language models on realistic scientific research coding problems, comprising 338 subproblems decomposed from 80 challenging main problems across 16 natural-science subfields (physics, math, chemistry, biology, materials science).

github.com/scicode-bench/SciCode62 reported

60.1%

Top reported · Fugu

MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark)Multimodal/accuracy/↑ higher

A benchmark of ~11.5K college-level multimodal questions spanning 30 subjects and 183 subfields across six disciplines, measuring a vision-language model's accuracy at jointly perceiving images (charts, diagrams, maps, tables, etc.) and reasoning with domain knowledge.

github.com/MMMU-Benchmark/MMMU19 reported

85.4%

Top reported · GPT-5.1

AA-Omniscience: Knowledge and Hallucination BenchmarkReasoning/AA-Omniscience Index/↑ higher

A factuality and knowledge benchmark of 6,000 questions across 42 economically relevant topics in six domains, scoring models on the AA-Omniscience Index (-100 to 100) that rewards correct answers, penalizes hallucinations, and applies no penalty for abstaining.

huggingface.co/datasets/ArtificialAnalysis/AA-Omniscience-Public22 reported

IFBenchReasoning/accuracy/↑ higher

Ai2's instruction-following benchmark that measures precise instruction-following generalization on 58 diverse, verifiable out-of-domain output constraints designed to test whether models can obey novel rules rather than overfit to familiar constraint templates.

github.com/allenai/IFBench28 reported

83.3%

Top reported · Grok 4.3

MultiChallengeReasoning/accuracy/↑ higher

A realistic multi-turn conversation benchmark by Scale AI (SEAL) that evaluates whether frontier LLMs can follow instructions, retain user information, perform versioned editing, and stay self-coherent across multiple conversational turns.

arxiv.org/abs/2501.1739926 reported

75.52%

Top reported · Muse Spark

OpenAI-MRCR v2 (Multi-Round Coreference Resolution)Reasoning/accuracy (mean SequenceMatcher similarity)/↑ higher

A long-context retrieval benchmark in which a model must locate and reproduce a specific instance (the i-th 'needle') of repeated similar requests buried in a long synthetic multi-turn conversation, scored on the 8-needle variant across context lengths up to 1M tokens.

huggingface.co/datasets/openai/mrcr13 reported

93.0%

LongBench v2Reasoning/accuracy/↑ higher

A long-context benchmark of 503 challenging multiple-choice questions with contexts from 8k to 2M words across six task categories, designed to test deep understanding and reasoning over realistic long-context multitasks.

github.com/THUDM/LongBench9 reported

63.3%

Top reported · Gemini 2.5 Pro

Global-MMLUReasoning/accuracy/↑ higher

A multilingual extension of MMLU covering 42 languages with culturally-sensitive and culturally-agnostic multiple-choice knowledge questions, measuring accuracy across diverse high-, mid-, and low-resource languages.

github.com/mrl-org/global-mmlu33 reported

93.2%

Video-MMMUMultimodal/accuracy/↑ higher

A multi-discipline benchmark evaluating large multimodal models' ability to acquire and apply knowledge from expert-level professional videos across six disciplines through three cognitive stages (Perception, Comprehension, Adaptation), measured by question-answering accuracy.

github.com/EvolvingLMMs-Lab/VideoMMMU10 reported

87.6%

Top reported · Gemini 3 Pro

WebDev ArenaChat preference/Elo/↑ higher

A live, community-driven leaderboard where two LLMs compete head-to-head to build interactive web applications from user-submitted prompts, with human votes ranking models by a Bradley-Terry (Elo-like) score.

43 reported

1562

Top reported · Claude Opus 4.7

Search ArenaChat preference/Elo/↑ higher

A crowdsourced human-preference leaderboard from LMArena that ranks search-augmented LLMs via blind pairwise votes on grounded, web-search answers, reported as Bradley-Terry Elo-scale ratings.

github.com/lmarena/search-arena23 reported

1251

Arena-Hard-Auto v2.0Chat preference/% win rate/↑ higher

An automatic LLM benchmark of 500 hard real-world queries (plus 250 creative-writing prompts) sourced from Chatbot Arena, scored as a win rate against a baseline using LLM judges (GPT-4.1 and Gemini-2.5) as a cheap proxy for human preference.

github.com/lmarena/arena-hard-auto9 reported

85.9%

Top reported · o3

EQ-Bench Creative Writing v3Chat preference/Elo/↑ higher

An LLM-judged creative writing benchmark that scores models across 32 prompts (3 iterations each) using a hybrid of rubric scoring and pairwise Elo comparisons computed with a margin-weighted Glicko-2 rating system.

github.com/EQ-bench/creative-writing-bench47 reported

2206

Top reported · Claude Opus 4.7

Design ArenaChat preference/Elo/↑ higher

A crowdsourced human-preference benchmark where top AI models receive identical design/frontend prompts and users vote head-to-head on the anonymized outputs, producing a Bradley-Terry (Elo) ranking of design taste across categories like websites, UI components, games, and data visualization.

62 reported

1344

AILuminate AI Safety BenchmarkOther/Safety grade/↑ higher

MLCommons' standardized AI safety benchmark that grades how often general-purpose chat models produce policy-violating responses across 12 hazard categories (e.g. violent crimes, CSAM, hate, self-harm, specialized advice), assigning an ordinal safety grade from Poor to Excellent relative to a sub-15B open-weight reference system.

github.com/mlcommons/ailuminate6 reported

Very Good

Top reported · Claude 3.5 Sonnet

MASK (Model Alignment between Statements and Knowledge)Other/Honesty score/↑ higher

A human-collected honesty benchmark that first elicits a model's beliefs, then measures whether the model maintains truthful assertions when directly or indirectly pressured to lie, disentangling honesty from factual accuracy.

github.com/centerforaisafety/mask33 reported

96.28

MCP-UniverseTool use/Overall Success Rate/↑ higher

A benchmark from Salesforce AI Research that evaluates LLMs and agents on real-world Model Context Protocol (MCP) server tasks across six domains (location navigation, repository management, financial analysis, 3D design, browser automation, web searching), measuring end-to-end task success rate.

github.com/SalesforceAIResearch/MCP-Universe24 reported

44.59%

Top reported · Gemini 3 Pro

CharXivMultimodal/accuracy/↑ higher

A multimodal benchmark of 2,323 real scientific charts from arXiv papers that evaluates chart understanding in MLLMs via descriptive questions and complex reasoning questions, with the reasoning split (CharXiv-R) measuring accuracy on questions that require synthesizing information across chart elements.

github.com/princeton-nlp/CharXiv21 reported

93.2%

OCRBench v2Multimodal/accuracy/↑ higher

A large-scale bilingual (English/Chinese) text-centric benchmark of ~10,000 human-verified QA pairs across 31 scenarios that evaluates large multimodal models on visual text localization, recognition, parsing, and reasoning.

github.com/Yuliang-Liu/MultimodalOCR10 reported

63.4

Top reported · Gemini 3 Pro

ScreenSpot-ProMultimodal/accuracy/↑ higher

A GUI grounding benchmark that measures how accurately a multimodal model can locate a referenced UI element (return its position) given a natural-language instruction and a full-screen, high-resolution screenshot of professional desktop software across 23 applications, 5 industries, and 3 operating systems.

github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding13 reported

87.9%

FACTS GroundingReasoning/Grounding accuracy/↑ higher

A Google DeepMind benchmark that measures how factually grounded an LLM's long-form responses are to a provided source document, scoring the share of responses that are eligible and fully supported by the context with no hallucinations.

18 reported

83.6%

Top reported · Gemini 2.0 Flash

BigCodeBenchCoding/calibrated Pass@1/↑ higher

A benchmark of 1,140 (Full) / 148 (Hard) function-level Python programming tasks requiring models to compose calls across 139 diverse libraries from complex instructions, scored by calibrated Pass@1 with greedy decoding.

github.com/bigcode-project/bigcodebench14 reported

40.5%

Top reported · DeepSeek V3

SWE-bench MultilingualCoding/% resolved/↑ higher

A software-engineering benchmark of 300 curated GitHub issue-resolution tasks spanning 42 repositories and 9 programming languages (C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, Rust), measuring the percentage of real-world issues a model can resolve so that fail-to-pass and pass-to-pass tests succeed.

github.com/SWE-bench/experiments17 reported

78.9%

Top reported · Ornith-1.0-397B

SWE-bench MultimodalCoding/% resolved/↑ higher

A software-engineering benchmark of 517 real GitHub issues from visual JavaScript/web projects that include visual context (screenshots, UI mockups, diagrams), measuring whether AI systems can resolve issues whose fixes are verified by the repository's tests.

github.com/SWE-bench/SWE-bench9 reported

59.0%

SuperGPQAReasoning/accuracy/↑ higher

A large-scale knowledge-and-reasoning benchmark of ~26,000 graduate-level multiple-choice questions (up to 10 answer options each) spanning 285 academic disciplines, measuring overall answer accuracy.

github.com/SuperGPQA/SuperGPQA14 reported

73.9%

Top reported · Qwen 3.6 Max Preview

EnigmaEvalReasoning/accuracy/↑ higher

A benchmark of 1,184 puzzle-hunt challenges spanning text and images that probes models' ability to perform implicit knowledge synthesis, lateral thinking, and multi-step deductive reasoning to uncover hidden solution paths.

arxiv.org/abs/2502.0885925 reported

23.82%

Top reported · GPT-5.4 Pro

ZeroBenchMultimodal/accuracy/↑ higher

An intentionally 'impossible' visual reasoning benchmark of 100 hand-crafted main questions (plus 334 subquestions) on which contemporary large multimodal models score near zero, designed to provide maximum headroom for measuring genuine multi-step visual understanding.

github.com/jonathan-roberts1/zerobench28 reported

23.0% (pass@5)

Top reported · GPT-5.4

IMO-BenchReasoning/accuracy/↑ higher

A suite of IMO-level mathematical reasoning benchmarks from Google DeepMind, whose IMO-AnswerBench component tests models on 400 robustified Olympiad problems (Algebra, Combinatorics, Geometry, Number Theory) with verifiable short answers graded by an autograder.

github.com/google-deepmind/superhuman/tree/main/imobench11 reported

73.1%

Top reported · Grok 4

PutnamBenchReasoning/Problems solved/↑ higher

A multilingual formal theorem-proving benchmark of hand-formalized William Lowell Putnam Mathematical Competition problems in Lean 4, Isabelle, and Coq, where a model's output is scored by whether the proof assistant's compiler verifies the proof.

github.com/trishullab/PutnamBench7 reported

28/660

Top reported · GPT-5

MathArena HMMT February 2026Reasoning/accuracy/↑ higher

Contamination-free evaluation of large language models on the 33 problems of the HMMT February 2026 mathematics competition, scoring final-answer accuracy (pass@1 estimated from 4 samples per problem) on problems released after model training.

github.com/eth-sri/matharena21 reported

97.73%

Top reported · GPT-5.4

FrontierMath Tier 4Reasoning/accuracy/↑ higher

FrontierMath Tier 4 is Epoch AI's expansion set of 50 exceptionally difficult, original research-level mathematics problems—crafted and vetted by expert mathematicians—that can take a specialist days to solve, measuring an AI model's advanced mathematical reasoning by exact-answer accuracy.

35 reported

39.6%

Vectara Hallucination LeaderboardOther/Hallucination Rate

Measures how often LLMs introduce hallucinations when summarizing short documents, scored by Vectara's HHEM-2.3 factual-consistency model, reported as a hallucination rate.

github.com/vectara/hallucination-leaderboard45 reported

4.5%

Top reported · Mistral Large

Gray Swan Arena (Agent Red-Teaming / Indirect Prompt Injection)Agents/Attack Success Rate (ASR)

A large-scale public red-teaming competition run on the Gray Swan Arena platform that measures how often adversarial attackers can break frontier AI agents (via jailbreaks and indirect prompt injection across tool-use, coding, and computer-use settings), reported as an attack success rate where lower is better.

github.com/grayswansecurity/ipi_arena_os14 reported

0.5%

Top reported · Claude Opus 4.5

PolyMath: Evaluating Mathematical Reasoning in Multilingual ContextsReasoning/Difficulty-Weighted Accuracy (DW-ACC)/↑ higher

A multilingual mathematical reasoning benchmark of 9,000 parallel problems across 18 languages and 4 difficulty levels (K-12 to Olympiad/frontier), scored by difficulty-weighted accuracy.

github.com/QwenLM/PolyMath11 reported

65.1%

Top reported · Kimi K2 Instruct

Vibe Code BenchCoding/Overall accuracy/↑ higher

An end-to-end web application development benchmark (by Vals AI / Replit) where models build complete full-stack web apps from natural-language specifications in a sandboxed environment with production services (Supabase, Stripe, email), then are scored by an autonomous browser agent on overall application pass accuracy.

33 reported

82.72%

Online-Mind2WebAgents/Task success rate/↑ higher

A live web-agent benchmark of 300 realistic tasks across 136 real websites that measures whether an autonomous agent can complete end-to-end web tasks on dynamic, online pages, scored as task success rate.

github.com/OSU-NLP-Group/Online-Mind2Web12 reported

92.8%

Top reported · GPT-5.4

WebArenaAgents/Task success rate/↑ higher

A reproducible, self-hostable web environment of fully functional sites (e-commerce, content management, social forum, and software development) where autonomous agents are scored on the functional-correctness success rate of completing 812 realistic, long-horizon, multi-step web tasks.

github.com/web-arena-x/webarena8 reported

68.0%

GSO: Software Optimization Benchmark for SWE-AgentsCoding/Opt@1/↑ higher

GSO evaluates AI coding agents on 102 challenging real-world software performance optimization tasks across 10 codebases in 5 languages, measuring whether an agent's patch matches expert-developer speedups while remaining correct.

github.com/gso-bench/gso22 reported

44.12%

Top reported · Claude Opus 4.7

MultiNRCReasoning/accuracy/↑ higher

A native (non-translated) multilingual reasoning benchmark of 1,000+ questions written by native speakers in French, Spanish, and Chinese across four categories (language-specific linguistic reasoning, wordplay/riddles, cultural/tradition reasoning, and culturally relevant math), scoring LLMs on accuracy.

huggingface.co/datasets/ScaleAI/MultiNRC27 reported

64.74%

Terminal-Bench 2.0Agents/task success/↑ higher

An agentic benchmark measuring whether an AI model can complete real command-line / terminal software tasks end-to-end (version 2.0, the 89-task set), scored by task success rate. Distinct from the newer Terminal-Bench 2.1 (a different task set); most 2026 model cards self-report this 2.0 version.

github.com/laude-institute/terminal-bench19 reported

84.3%

SWE-MarathonAgents/resolution rate (pass@1)/↑ higher

A long-horizon software-engineering benchmark of 20 realistic, multi-hour tasks (library reproductions, full-stack product clones, ML-engineering, and algorithmic optimization) that test whether frontier coding agents can autonomously complete ultra-long-horizon work; scored by binary pass@1 resolution rate with reward-hacking-resistant verifiers.

github.com/abundant-ai/swe-marathon15 reported

29.0%

Top reported · Grok 4.5

FrontierCodeCoding/weighted score (Diamond)/↑ higher

Cognition's benchmark for code mergeability and production quality, not just correctness. Tasks are drawn from 36 real open-source repositories and authored by their maintainers (40+ hours each), with concise, humanlike prompts (~1/3 the length of SWE-bench Pro). Solutions are graded against a maintainer-style rubric spanning behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality; the reported score is a weighted aggregate of the rubric items, and any solution that fails a 'blocker' criterion scores 0. Three nested subsets are published — Diamond (50 hardest tasks), Main (100), and Extended (150) — with each model run 5× at every available reasoning effort and the best effort reported. Tasks are kept private to avoid contamination.

13 reported

29.3%

FrontierSWEAgents/dominance score/↑ higher

Proximal Labs' ultra-long-horizon coding-agent benchmark: 17 open-ended technical projects spanning implementation, performance engineering, and applied ML research (e.g. optimizing a real compiler, inventing better ML optimizers, building a PostgreSQL-compatible server backed by SQLite). Agents get up to 20 hours per task and 5 trials each; tasks are graded 0–1 on partial progress, and frontier models barely make headway — making FrontierSWE one of the few unsaturated public coding benchmarks. Models are ranked by 'dominance' (win rate against a random opponent across tasks).

github.com/Proximal-Labs/frontier-swe13 reported

90%

ProgramBenchCoding/almost-resolved rate/↑ higher

A cleanroom software-reconstruction benchmark (Meta Superintelligence Labs, Stanford, Harvard) of 200 heterogeneous tasks built from real tools like jq, ripgrep, SQLite, and FFmpeg. Given only a reference executable and its documentation — no source, no decompiling, no internet — the agent must choose a language, design the architecture, and rebuild the program, graded by ~248,000 agent-fuzzed behavioral tests (stdout, stderr, exit codes, file outputs). A task is 'resolved' only if every test passes; fully-resolved is ≤0.5% for all frontier models, so the leaderboard's effective ranking is the almost-resolved rate (tasks nearly reconstructed). Evaluated with the mini-SWE-agent harness.

7 reported

13.5%

CursorBenchAgents/score/↑ higher

Cursor's agentic-coding benchmark built from real, anonymized Cursor sessions: ambiguous, multi-file tasks spanning codebase understanding, bug finding, planning, code review, editing, refactoring, and bug fixes. Each model is evaluated across reasoning-effort levels; alongside the headline pass score, Cursor reports average cost per task (USD), tokens per task, and steps per task. Cursor cautions that small score differences may not be statistically meaningful.

8 reported

72.9%

PostTrainBenchAgents/weighted average score/↑ higher

Measures AI R&D automation: can a coding agent autonomously post-train (fine-tune) a base LLM to improve it? Each agent gets 4 small base models (Qwen3 1.7B, Qwen3 4B, SmolLM3-3B, Gemma 3 4B), a single H100 GPU, and a 10-hour budget to maximize each model's performance using techniques of its choosing (SFT, RL/GRPO, LoRA/QLoRA, DPO, etc.) via its native CLI scaffold (Claude Code, Codex CLI, Gemini CLI, OpenCode). The post-trained models are then evaluated with Inspect — respecting each model's generation_config.json — across 7 benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). The reported score is the weighted average across all 4 base models and 7 benchmarks. For reference, the officially-released instruct versions of the base models average 51.1% (without the 10h/1-GPU constraint) and the un-post-trained base models score 7.5% zero-shot.

github.com/aisa-group/PostTrainBench21 reported

37.23%