evals.report
BenchmarksSourcesLabsCompareRun guides
LabsxAI

xAI

Model provider for Grok-family public benchmark rows.

5 models11 results x.ai

Models 5

Progress by benchmark

Show progress on
Grok 4
Jul 9, 2025
Grok 4.1 fast reasoning
Nov 1, 2025
Grok 4.20 beta reasoning
Mar 5, 2026
Grok 4.2
Mar 9, 2026
Grok 4.3
Apr 17, 2026
Single benchmark only
This view shows SWE-bench Verified (% resolved) only. Other benchmarks use different metrics and are not directly comparable.

Progress matrix

ModelSWE-bench Verified
% resolved
GPQA Diamond
accuracy
LiveCodeBench Pro
Codeforces Elo
Berkeley Function Calling Leaderboard
accuracy
LiveBench
score
Terminal-Bench 2.1
task success
SWE-bench Pro
% resolved
DeepSWE
% resolved
Humanity's Last Exam
accuracy
MMMU-Pro
accuracy
LMArena
source-defined rating
ARC-AGI-3
accuracy
ARC-AGI-2
accuracy
FrontierMath
accuracy
AIME (OTIS Mock)
accuracy
SimpleQA Verified
accuracy
Grok 4
Grok
87.0%62.97%24.52%19.66%47.9%
Grok 4.1 fast reasoning
Grok
69.57%
Grok 4.20 beta reasoning
Grok
67.96%14530.09%
Grok 4.2
Grok
30.2%
Grok 4.3
Grok
33.12%

Scores are not normalised across benchmarks. Each column uses its own metric. Compare columns independently.