How to run DeepSWE — benchmark guide

Run guidesCoding

DeepSWE is a 113-task long-horizon SWE benchmark (TypeScript, Go, Python, JavaScript, Rust) using the Harbor task format with program-based behavioral verifiers. You clone the public deep-swe task repo and run it with the datacurve-pier runner ('pier'), which drives mini-swe-agent (model-agnostic) or CLI agents (claude-code, codex, gemini-cli, opencode) in isolated Docker or Modal sandboxes; each task's verifier runs after the agent and emits a pass/fail, and the aggregate pass rate is the headline % resolved. Keep attached to any score: the agent (mini-swe-agent for leaderboard parity), the exact --model name, the environment (docker vs modal), task count / sample-seed if a subset, and the pier/deep-swe commit, since the official leaderboard uses mini-swe-agent on Modal.

Benchmark

DeepSWE

Repository

github.com/datacurve-ai/deep-swe

Dataset

github.com/datacurve-ai/deep-swe

Metric

% resolved

1Install

shell

git clone https://github.com/datacurve-ai/deep-swe

shell

uv tool install datacurve-pier

2Run evaluation

shell

export ANTHROPIC_API_KEY=...

shell

pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7

shell

# OpenAI model instead:
export OPENAI_API_KEY=...

shell

pier run -p deep-swe/tasks --agent mini-swe-agent --model openai/gpt-5.5

shell

# Deterministic random subset (e.g. 10 of 113 tasks):
pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0

shell

# Single task:
pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent

shell

# Run in parallel Modal sandboxes (leaderboard parity):
pier run -p deep-swe/tasks --agent mini-swe-agent --env modal --env-file .env

3Expected output

Per-task trials are written under jobs/<timestamp_or_name>/<trial_id>/. Grading is folded into 'pier run': after the agent finishes, the task's program-based verifier (tests/test.sh with tests/test.patch applied at grading time) exercises the described behavior through public APIs/observable outputs and records pass/fail for that trial. The headline metric is the aggregate pass rate (% resolved) over the 113 tasks, reported with confidence intervals on the official leaderboard (e.g. 'gpt-5.5 [xhigh] 70%±4%', 'claude-opus-4.7 [max] 54%±5%'). The verifier accepts any solution with correct observable behavior; the held-out solution/ reference patch is never used at grading time. Report your own pass rate as benchmark-local and do not compare across different agents/environments or to other SWE benchmarks (e.g. SWE-bench).

4Submit results

There is no public auto-submission flow. To appear on the official DeepSWE leaderboard, contact serena@datacurve.ai with your results (per the /run page). Keep the run context attached to any reported number: agent (use mini-swe-agent for leaderboard comparability), exact --model name and reasoning effort, --env (docker vs modal), whether you ran the full 113-task corpus or a --n-tasks/--sample-seed subset, and the pier + deep-swe commit hashes. Official leaderboard scores were produced with Pier running mini-swe-agent on Modal.

Gotchas

Tasks run in sandboxes that need Docker (--env docker, the default) or Modal (--env modal); tasks set allow_internet=false and pull prebuilt images from public ECR (task.toml shows docker_image = public.ecr.aws/.../swe-bench-202605:<ext_id>), with the environment/Dockerfile as a rebuild fallback. Configure your container runtime / Modal account before running.

There is no separate score script: grading is built into 'pier run' and trials land under jobs/<...>/<trial_id>/. Pier does not auto-emit a single aggregate pass-rate file, so you must aggregate pass/fail across trials yourself; use 'pier view' / 'pier job' to inspect results.

For leaderboard parity every model is run through mini-swe-agent (same bash tool and shared prompt, no per-vendor editing primitives). Switching to --agent claude-code/codex/gemini-cli/opencode changes the scaffold and makes scores non-comparable to the official leaderboard.

mini-swe-agent picks its adapter from the model-name prefix (openai/... -> litellm_response, openrouter/... -> openrouter BYOK, else LiteLLM auto), so the --model prefix matters; resource/timeout limits (per task.toml: verifier 1800s, agent 5400s, 2 CPU / 8GB) mean long-horizon tasks can be slow and costly.