DeepSeek V3.1

DeepSeek · DeepSeek V3. Released Aug 21, 2025.

DeepSeek V3.1 is a model from DeepSeek in the DeepSeek V3 family, released Aug 21, 2025. evals.report tracks 20 reported DeepSeek V3.1 benchmark scores across SWE-bench Verified, GPQA Diamond, Humanity's Last Exam, Artificial Analysis Intelligence Index, Epoch Capabilities Index, Aider Polyglot, MMLU-Pro, GAIA: A Benchmark for General AI Assistants, and 12 more — each shown with its benchmark, metric, source status, and date, and never combined into a single ranking.

Open20 results

Benchmark results 20

Compare this model

Benchmark	Category	Score	Metric	Status	Date
SWE-bench Verified	Coding	66.0%	% resolved	Verified	Aug 21, 2025	Details
GPQA Diamond	Reasoning	80.1%	accuracy	Verified	Aug 21, 2025	Details
Humanity's Last Exam	Reasoning	15.9%	accuracy	Verified	Aug 21, 2025	Details
Artificial Analysis Intelligence Index	Reasoning	28.1	Index	Unverified	Aug 21, 2025	Details
Epoch Capabilities Index	Reasoning	138.9	Index	Official	Aug 21, 2025	Details
Aider Polyglot	Coding	68.4%	% correct	Unverified	Aug 21, 2025	Details
MMLU-Pro	Reasoning	85.1%	accuracy	Verified	Aug 21, 2025	Details
GAIA: A Benchmark for General AI Assistants	Agents	11.5%	accuracy	Unverified	Aug 21, 2025	Details
GDPval	Agents	1080	Elo	Official	Aug 21, 2025	Details
LiveCodeBench	Coding	57.7%	Pass@1	Unverified	Aug 21, 2025	Details
SciCode	Coding	36.7%	accuracy	Unverified	Aug 21, 2025	Details
MultiChallenge	Reasoning	46.10%	accuracy	Verified	Aug 21, 2025	Details
Global-MMLU	Reasoning	82.7%	accuracy	Unverified	Aug 21, 2025	Details
EQ-Bench Creative Writing v3	Chat preference	1420	Elo	Verified	Aug 21, 2025	Details
Design Arena	Chat preference	1166	Elo	Verified	Aug 21, 2025	Details
MASK (Model Alignment between Statements and Knowledge)	Other	46.27	Honesty score	Verified	Aug 21, 2025	Details
MCP-Universe	Tool use	22.08%	Overall Success Rate	Verified	Aug 21, 2025	Details
Vectara Hallucination Leaderboard	Other	5.5%	Hallucination Rate	Official	Aug 21, 2025	Details
Gray Swan Arena (Agent Red-Teaming / Indirect Prompt Injection)	Agents	5.4%	Attack Success Rate (ASR)	Verified	Aug 21, 2025	Details
MultiNRC	Reasoning	23.60%	accuracy	Official	Aug 21, 2025	Details