evals.report
BenchmarksLabsCompareRun guides
Models
1 selected
Muse SparkMeta
Benchmarks
2 selected
SWE-bench VerifiedCodingGPQA DiamondReasoning
BenchmarkMuse SparkMeta
SWE-bench Verified% resolved
GPQA Diamondaccuracy89.8%
SWE-bench Verified
% resolved
Muse Spark · Meta
GPQA Diamond
accuracy
Muse Spark · Meta
89.8%

No aggregate score is calculated. Each row uses its benchmark’s own metric. Compare rows independently.