BenchmarksCoding
SWE-Lancer
An OpenAI benchmark of over 1,400 real freelance software engineering tasks from Upwork (worth $1M in total payouts), where models either implement IC SWE code patches graded by end-to-end tests or act as SWE managers selecting the best technical proposal, measured by task pass rate and dollars earned.
CodingIC SWE pass rate (Diamond)Higher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude 3.5 Sonnet | Anthropic | 26.2% | — | Verified | Jun 20, 2024 | Details |
| GPT-4o | OpenAI | 8.6% | — | Verified | May 13, 2024 | Details |
Each row reports the model’s IC SWE pass rate (Diamond) on SWE-Lancer. Click a row for the full run context.