evals.report
BenchmarksSourcesLabsCompareRun guides
LabsMeta

Meta

Model lab for Llama and Meta public benchmark rows.

4 models9 results ai.meta.com

Models 4

Progress by benchmark

Show progress on
Llama 3.1 405B
Jul 23, 2024
Llama 4 Scout
Apr 5, 2025
Llama 4 Maverick
Apr 5, 2025
Muse Spark
Apr 1, 2026
Single benchmark only
This view shows SWE-bench Verified (% resolved) only. Other benchmarks use different metrics and are not directly comparable.

Progress matrix

ModelSWE-bench Verified
% resolved
GPQA Diamond
accuracy
LiveCodeBench Pro
Codeforces Elo
Berkeley Function Calling Leaderboard
accuracy
LiveBench
score
Terminal-Bench 2.1
task success
SWE-bench Pro
% resolved
DeepSWE
% resolved
Humanity's Last Exam
accuracy
MMMU-Pro
accuracy
LMArena
source-defined rating
ARC-AGI-3
accuracy
ARC-AGI-2
accuracy
FrontierMath
accuracy
AIME (OTIS Mock)
accuracy
SimpleQA Verified
accuracy
Llama 3.1 405B
Llama
11.18%
Llama 4 Scout
Llama
Llama 4 Maverick
Llama
37.29%5.24%
Muse Spark
Muse
89.8%55.00%80.4%147488.9%66.3%

Scores are not normalised across benchmarks. Each column uses its own metric. Compare columns independently.