LabsMeta
Models 4
Llama 3.1 405B
Llama · llama 3.1 405b
2024-07-23
1 results
Llama 4 Scout
Llama · llama 4 scout
2025-04-05
0 results
Llama 4 Maverick
Llama · llama 4 maverick
2025-04-05
2 results
Muse Spark
Muse · muse-spark
2026-04-01
6 results
Progress by benchmark
Show progress on
Llama 3.1 405B
Jul 23, 2024
—
Llama 4 Scout
Apr 5, 2025
—
Llama 4 Maverick
Apr 5, 2025
—
Muse Spark
Apr 1, 2026
—
Single benchmark only
This view shows SWE-bench Verified (% resolved) only. Other benchmarks use different metrics and are not directly comparable.
Progress matrix
| Model | SWE-bench Verified % resolved | GPQA Diamond accuracy | LiveCodeBench Pro Codeforces Elo | Berkeley Function Calling Leaderboard accuracy | LiveBench score | Terminal-Bench 2.1 task success | SWE-bench Pro % resolved | DeepSWE % resolved | Humanity's Last Exam accuracy | MMMU-Pro accuracy | LMArena source-defined rating | ARC-AGI-3 accuracy | ARC-AGI-2 accuracy | FrontierMath accuracy | AIME (OTIS Mock) accuracy | SimpleQA Verified accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama 3.1 405B Llama | — | — | — | — | — | — | 11.18% | — | — | — | — | — | — | — | — | — |
| Llama 4 Scout Llama | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
| Llama 4 Maverick Llama | — | — | — | 37.29% | — | — | 5.24% | — | — | — | — | — | — | — | — | — |
| Muse Spark Muse | — | 89.8% | — | — | — | — | 55.00% | — | — | 80.4% | 1474 | — | — | — | 88.9% | 66.3% |
Scores are not normalised across benchmarks. Each column uses its own metric. Compare columns independently.