evals.report
BenchmarksLabsCompareRun guides
DeepSeekDeepSeek V3

DeepSeek V3.1

DeepSeek · DeepSeek V3. Released Aug 21, 2025.

19 results

Benchmark results 19

Compare this model
BenchmarkCategoryScoreMetricStatusDate
SWE-bench VerifiedCoding66.0%% resolvedVerifiedAug 21, 2025Details
GPQA DiamondReasoning80.1%accuracyVerifiedAug 21, 2025Details
Humanity's Last ExamReasoning15.9%accuracyVerifiedAug 21, 2025Details
Artificial Analysis Intelligence IndexReasoning28.1IndexUnverifiedAug 21, 2025Details
Epoch Capabilities IndexReasoning138.9IndexOfficialAug 21, 2025Details
Aider PolyglotCoding68.4%% correctUnverifiedAug 21, 2025Details
MMLU-ProReasoning85.1%accuracyVerifiedAug 21, 2025Details
GAIA: A Benchmark for General AI AssistantsAgents11.5%accuracyUnverifiedAug 21, 2025Details
GDPvalAgents1080EloOfficialAug 21, 2025Details
LiveCodeBenchCoding57.7%Pass@1UnverifiedAug 21, 2025Details
SciCodeCoding36.7%accuracyUnverifiedAug 21, 2025Details
MultiChallengeReasoning46.10%accuracyVerifiedAug 21, 2025Details
Global-MMLUReasoning82.7%accuracyUnverifiedAug 21, 2025Details
EQ-Bench Creative Writing v3Chat preference1420EloVerifiedAug 21, 2025Details
Design ArenaChat preference1166EloVerifiedAug 21, 2025Details
MASK (Model Alignment between Statements and Knowledge)Other46.27Honesty scoreVerifiedAug 21, 2025Details
MCP-UniverseTool use22.08%Overall Success RateVerifiedAug 21, 2025Details
Vectara Hallucination LeaderboardOther5.5%Hallucination RateOfficialAug 21, 2025Details
Gray Swan Arena (Agent Red-Teaming / Indirect Prompt Injection)Agents5.4%Attack Success Rate (ASR)VerifiedAug 21, 2025Details