evals.report
BenchmarksLabsCompareRun guides

Kimi K2 Thinking

Moonshot AI · Kimi K2. Released Nov 6, 2025.

13 results

Benchmark results 13

Compare this model
BenchmarkCategoryScoreMetricStatusDate
SWE-bench VerifiedCoding71.3%% resolvedVerifiedNov 6, 2025Details
GPQA DiamondReasoning84.5%accuracyVerifiedNov 6, 2025Details
Humanity's Last ExamReasoning23.9%accuracyVerifiedNov 6, 2025Details
Epoch Capabilities IndexReasoning145.6IndexOfficialNov 6, 2025Details
MMLU-ProReasoning84.6%accuracyUnverifiedNov 6, 2025Details
BrowseCompAgents60.2%accuracyVerifiedNov 6, 2025Details
GDPvalAgents992EloOfficialNov 6, 2025Details
MultiChallengeReasoning55.42%accuracyVerifiedNov 6, 2025Details
Global-MMLUReasoning73.5%accuracyUnverifiedNov 6, 2025Details
WebDev ArenaChat preference1329EloVerifiedNov 6, 2025Details
EQ-Bench Creative Writing v3Chat preference1695EloVerifiedNov 6, 2025Details
MCP-UniverseTool use26.41%Overall Success RateVerifiedNov 6, 2025Details
Gray Swan Arena (Agent Red-Teaming / Indirect Prompt Injection)Agents4.8%Attack Success Rate (ASR)VerifiedNov 6, 2025Details