Question 1

What is DeepSWE?

Accepted Answer

A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers. It is a coding benchmark measured by % resolved.

Question 2

What does % resolved mean on DeepSWE?

Accepted Answer

DeepSWE reports % resolved (%); higher is better. Scores are shown only within DeepSWE and are never averaged with other benchmarks.

Question 3

What is the top reported DeepSWE score?

Accepted Answer

GPT-5.5 has the top reported score on DeepSWE: 70.05% (% resolved).

Question 4

Are community DeepSWE runs official?

Accepted Answer

No. Community runs are independent reproductions shown separately from official scores, each labeled with its source and run caveats, and never merged with the official number.

Question 5

Why do DeepSWE scores differ across runs?

Accepted Answer

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Question 6

Does evals.report rank models across benchmarks?

Accepted Answer

No. DeepSWE scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

Model	Lab	Score↓	Source model	Status	Date
GPT-5.5	OpenAI	70.05%	gpt-5-5	Official	Apr 23, 2026	Details
Claude Opus 4.8	Anthropic	58%	Claude Opus 4.8 [max]	Verified	May 28, 2026	Details
GPT-5.4	OpenAI	55.53%	gpt-5-4	Official	Mar 5, 2026	Details
Claude Opus 4.7	Anthropic	54.20%	claude-opus-4-7	Official	Apr 16, 2026	Details
GLM-5.2Open	Z.ai	46.2%	GLM-5.2	Verified	Jun 16, 2026	Details
Nex-N2-ProOpen	Nex AGI	33.6%	Nex-N2-Pro	Verified	Jun 2, 2026	Details
Claude Sonnet 4.6	Anthropic	31.56%	claude-sonnet-4-6	Official	Feb 17, 2026	Details
Gemini 3.5 Flash	Google DeepMind	28.32%	gemini-3-5-flash	Official	May 19, 2026	Details
Claude Opus 4.6	Anthropic	27.06%	claude-opus-4-6	Official	Feb 5, 2026	Details
Kimi K2.6Open	Moonshot AI	23.89%	kimi-k2-6	Official	Apr 20, 2026	Details
GLM-5.1Open	Z.ai	17.48%	glm-5-1	Official	Apr 7, 2026	Details
MiniMax M3Open	MiniMax	13.3%	MiniMax-M3 [default]	Community	Jun 1, 2026	Details
Gemini 3.1 Pro Preview	Google DeepMind	9.88%	gemini-3-1-pro-preview	Official	Feb 19, 2026	Details
Nex-N2-miniOpen	Nex AGI	8.0%	Nex-N2-mini	Verified	Jun 2, 2026	Details
DeepSeek V4 ProOpen	DeepSeek	7.52%	deepseek-v4-pro	Official	Apr 24, 2026	Details
Community run · @ivanfioravanti (X)	—	5.3%▼2.2	DeepSeek V4 Pro [reasoning max]	Community	Apr 24, 2026	Details
Gemini 3 Flash	Google DeepMind	5.16%	gemini-3-flash-preview	Official	Dec 17, 2025	Details
Qwen 3.6 Plus	Alibaba / Qwen	2.65%	qwen3-6-plus	Official	Apr 2, 2026	Details
Qwen 3.6 27BOpen	Alibaba / Qwen	1.79%	Qwen 3.6 27B (FP8)	Community	Apr 22, 2026	Details
Claude Haiku 4.5	Anthropic	0.22%	claude-haiku-4-5	Official	Oct 15, 2025	Details