What does % resolved mean on DeepSWE?

DeepSWE reports % resolved (%); higher is better. Scores are shown only within DeepSWE and are never averaged with other benchmarks.

What is the top reported DeepSWE score?

GPT-5.5 has the top reported score on DeepSWE: 70.05% (% resolved).

Are community DeepSWE runs official?

No. Community runs are independent reproductions shown separately from official scores, each labeled with its source and run caveats, and never merged with the official number.

Why do DeepSWE scores differ across runs?

Harness, scaffold, reasoning effort, and prompt setup change results, so two runs of the same model can differ. evals.report keeps each score with its run context so the differences stay visible.

Does evals.report rank models across benchmarks?

No. DeepSWE scores are shown within their own metric; evals.report never combines benchmarks into a composite ranking or a single "best model".

BenchmarksCoding

DeepSWE

A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.

Coding% resolvedHigher is better

Scores About Run this benchmark

DeepSWE is a 113-task long-horizon SWE benchmark (TypeScript, Go, Python, JavaScript, Rust) using the Harbor task format with program-based behavioral verifiers. You clone the public deep-swe task repo and run it with the datacurve-pier runner ('pier'), which drives mini-swe-agent (model-agnostic) or CLI agents (claude-code, codex, gemini-cli, opencode) in isolated Docker or Modal sandboxes; each task's verifier runs after the agent and emits a pass/fail, and the aggregate pass rate is the headline % resolved. Keep attached to any score: the agent (mini-swe-agent for leaderboard parity), the exact --model name, the environment (docker vs modal), task count / sample-seed if a subset, and the pier/deep-swe commit, since the official leaderboard uses mini-swe-agent on Modal.

Benchmark

DeepSWE

Repository

github.com/datacurve-ai/deep-swe

Dataset

github.com/datacurve-ai/deep-swe

Metric

% resolved

1Install

shell

git clone https://github.com/datacurve-ai/deep-swe

shell

uv tool install datacurve-pier

2Run evaluation

shell

export ANTHROPIC_API_KEY=...

shell

pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7

shell

# OpenAI model instead:
export OPENAI_API_KEY=...

shell

pier run -p deep-swe/tasks --agent mini-swe-agent --model openai/gpt-5.5

shell

# Deterministic random subset (e.g. 10 of 113 tasks):
pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0

shell

# Single task:
pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent

shell

# Run in parallel Modal sandboxes (leaderboard parity):
pier run -p deep-swe/tasks --agent mini-swe-agent --env modal --env-file .env

3Expected output

Per-task trials are written under jobs/<timestamp_or_name>/<trial_id>/. Grading is folded into 'pier run': after the agent finishes, the task's program-based verifier (tests/test.sh with tests/test.patch applied at grading time) exercises the described behavior through public APIs/observable outputs and records pass/fail for that trial. The headline metric is the aggregate pass rate (% resolved) over the 113 tasks, reported with confidence intervals on the official leaderboard (e.g. 'gpt-5.5 [xhigh] 70%±4%', 'claude-opus-4.7 [max] 54%±5%'). The verifier accepts any solution with correct observable behavior; the held-out solution/ reference patch is never used at grading time. Report your own pass rate as benchmark-local and do not compare across different agents/environments or to other SWE benchmarks (e.g. SWE-bench).

4Submit results

There is no public auto-submission flow. To appear on the official DeepSWE leaderboard, contact serena@datacurve.ai with your results (per the /run page). Keep the run context attached to any reported number: agent (use mini-swe-agent for leaderboard comparability), exact --model name and reasoning effort, --env (docker vs modal), whether you ran the full 113-task corpus or a --n-tasks/--sample-seed subset, and the pier + deep-swe commit hashes. Official leaderboard scores were produced with Pier running mini-swe-agent on Modal.

Gotchas

Tasks run in sandboxes that need Docker (--env docker, the default) or Modal (--env modal); tasks set allow_internet=false and pull prebuilt images from public ECR (task.toml shows docker_image = public.ecr.aws/.../swe-bench-202605:<ext_id>), with the environment/Dockerfile as a rebuild fallback. Configure your container runtime / Modal account before running.

There is no separate score script: grading is built into 'pier run' and trials land under jobs/<...>/<trial_id>/. Pier does not auto-emit a single aggregate pass-rate file, so you must aggregate pass/fail across trials yourself; use 'pier view' / 'pier job' to inspect results.

For leaderboard parity every model is run through mini-swe-agent (same bash tool and shared prompt, no per-vendor editing primitives). Switching to --agent claude-code/codex/gemini-cli/opencode changes the scaffold and makes scores non-comparable to the official leaderboard.

mini-swe-agent picks its adapter from the model-name prefix (openai/... -> litellm_response, openrouter/... -> openrouter BYOK, else LiteLLM auto), so the --model prefix matters; resource/timeout limits (per task.toml: verifier 1800s, agent 5400s, 2 CPU / 8GB) mean long-horizon tasks can be slow and costly.