evals.report
BenchmarksLabsCompareRun guides
LabsStepFun

StepFun

Model provider for Step-family public benchmark rows.

1 models5 results stepfun.com

Models 1

Progress by benchmark

Show progress on
May 29, 2026
76.5%
Single benchmark only
This view shows SWE-bench Verified (% resolved) only. Other benchmarks use different metrics and are not directly comparable.

Progress matrix

ModelSWE-bench Verified
% resolved
Terminal-Bench 2.1
task success
DeepSWE
% resolved
GPQA Diamond
accuracy
LiveCodeBench Pro
Codeforces Elo
Humanity's Last Exam
accuracy
LiveBench
score
SWE-bench Pro
% resolved
Berkeley Function Calling Leaderboard
accuracy
MMMU-Pro
accuracy
LMArena
source-defined rating
ARC-AGI-3
accuracy
ARC-AGI-2
accuracy
FrontierMath
accuracy
AIME (OTIS Mock)
accuracy
SimpleQA Verified
accuracy
GBA Eval
overall score
WeirdML
average accuracy
SWE-fficiency
speedup score
KernelBench (Hard)
fast₁
MCP Atlas
pass rate
Artificial Analysis Intelligence Index
Index
Epoch Capabilities Index
Index
Aider Polyglot
% correct
SWE-rebench
Resolved rate (pass@1)
MMLU-Pro
accuracy
OSWorld
task success rate
GAIA: A Benchmark for General AI Assistants
accuracy
BrowseComp
accuracy
τ²-bench (Telecom)
pass^1
AIME 2026
accuracy
MathVista
accuracy
Video-MME
accuracy
GDPval
Elo
LiveCodeBench
Pass@1
METR Task-Completion Time Horizons
50% time horizon
SWE-Lancer
IC SWE pass rate (Diamond)
SciCode
accuracy
MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark)
accuracy
BIG-Bench Extra Hard
harmonic mean accuracy
AA-Omniscience: Knowledge and Hallucination Benchmark
AA-Omniscience Index
IFBench
accuracy
MultiChallenge
accuracy
RULER
accuracy
OpenAI-MRCR v2 (Multi-Round Coreference Resolution)
accuracy (mean SequenceMatcher similarity)
LongBench v2
accuracy
Global-MMLU
accuracy
MMLU-ProX
accuracy
Video-MMMU
accuracy
WebDev Arena
Elo
Search Arena
Elo
Arena-Hard-Auto v2.0
% win rate
EQ-Bench Creative Writing v3
Elo
Design Arena
Elo
AgentHarm
Harm score
AgentDojo
utility under attack
AILuminate AI Safety Benchmark
Safety grade
MASK (Model Alignment between Statements and Knowledge)
Honesty score
MCP-Universe
Overall Success Rate
CharXiv
accuracy
OCRBench v2
accuracy
ScreenSpot-Pro
accuracy
FACTS Grounding
Grounding accuracy
BigCodeBench
calibrated Pass@1
SWE-bench Multilingual
% resolved
SWE-bench Multimodal
% resolved
SuperGPQA
accuracy
EnigmaEval
accuracy
ZeroBench
accuracy
IMO-Bench
accuracy
PutnamBench
Problems solved
MathArena HMMT February 2026
accuracy
FrontierMath Tier 4
accuracy
Vectara Hallucination Leaderboard
Hallucination Rate
Gray Swan Arena (Agent Red-Teaming / Indirect Prompt Injection)
Attack Success Rate (ASR)
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts
Difficulty-Weighted Accuracy (DW-ACC)
Step 3.7 Flash
StepFun Step series
76.5%59.6%56.3%42.61298

Scores are not normalised across benchmarks. Each column uses its own metric. Compare columns independently.