BenchmarksCoding
WeirdML
Tests whether LLMs can do machine learning on novel, unusual datasets: each model writes and iteratively debugs PyTorch code over 5 feedback rounds in a sandboxed GPU container, scored on held-out test accuracy across 17 tasks (6 public, 13 hidden).
Codingaverage accuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| GPT-5.5 | OpenAI | 84.9% | gpt-5.5 (xhigh) | Official | — | Details |
| Claude Opus 4.8 | Anthropic | 82.9% | claude-opus-4.8 (xhigh) | Official | — | Details |
| Claude Opus 4.6 | Anthropic | 77.9% | claude-opus-4.6 (high) | Official | — | Details |
| GPT-5.3-Codex | OpenAI | 77.9% | gpt-5.3-codex (xhigh) | Official | — | Details |
| GPT-5.4 | OpenAI | 77.7% | gpt-5.4 (xhigh) | Official | — | Details |
| Claude Opus 4.7 | Anthropic | 76.4% | claude-opus-4.7 (high) | Official | — | Details |
| GPT-5.2 | OpenAI | 72.2% | gpt-5.2 (xhigh) | Official | — | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 72.1% | gemini-3.1-pro-preview (high) | Official | — | Details |
| Gemini 3 Pro | Google DeepMind | 69.9% | gemini-3-pro-preview (high) | Official | — | Details |
| Claude Sonnet 4.6 | Anthropic | 66.1% | claude-sonnet-4.6 (medium) | Official | — | Details |
| Claude Opus 4.5 | Anthropic | 63.7% | claude-opus-4.5 (high, 16k) | Official | — | Details |
| Gemini 3.5 Flash | Google DeepMind | 62.6% | gemini-3.5-flash (high) | Official | — | Details |
| Gemini 3 Flash | Google DeepMind | 61.6% | gemini-3-flash-preview (high) | Official | — | Details |
| GPT-5.1 | OpenAI | 60.8% | gpt-5.1 (high) | Official | — | Details |
| GPT-5 | OpenAI | 60.7% | gpt-5 (high) | Official | — | Details |
| GLM-5.1 | Z.ai | 57.1% | glm-5.1 | Official | — | Details |
| Kimi K2.6 | Moonshot AI | 55.9% | kimi-k2.6 | Official | — | Details |
| Gemini 2.5 Pro | Google DeepMind | 54.0% | gemini-2.5-pro (thinking 16k) | Official | — | Details |
| GPT-5 mini | OpenAI | 52.7% | gpt-5-mini (high) | Official | — | Details |
| o4-mini | OpenAI | 52.6% | o4-mini (high) | Official | — | Details |
| o3 | OpenAI | 52.4% | o3 (high) | Official | — | Details |
| Grok 4.20 beta reasoning | xAI | 52.3% | grok-4.20-beta | Official | — | Details |
| Grok 4.3 | xAI | 49.9% | grok-4.3 | Official | — | Details |
| DeepSeek V4 Pro | DeepSeek | 48.9% | deepseek-v4-pro (max) | Official | — | Details |
| GLM-5 | Z.ai | 48.2% | glm-5 (thinking) | Official | — | Details |
| GPT-OSS-120B | OpenAI | 48.2% | gpt-oss-120b (high) | Official | — | Details |
| Claude Sonnet 4.5 | Anthropic | 47.7% | claude-sonnet-4.5 (thinking 16k) | Official | — | Details |
| Claude Sonnet 4 | Anthropic | 46.1% | claude-4-sonnet (thinking 16k) | Official | — | Details |
| Claude Opus 4.1 | Anthropic | 45.9% | claude-opus-4.1 (thinking 16k) | Official | — | Details |
| Grok 4 | xAI | 45.7% | grok-4-07-09 | Official | — | Details |
| Kimi K2.5 | Moonshot AI | 45.6% | kimi-k2.5 | Official | — | Details |
| Claude Haiku 4.5 | Anthropic | 45.4% | claude-haiku-4.5 (no thinking) | Official | — | Details |
| Claude Opus 4 | Anthropic | 43.7% | claude-4-opus (thinking 16k) | Official | — | Details |
| Qwen 3 Coder 480B | Alibaba / Qwen | 41.2% | qwen3-coder | Official | — | Details |
| Gemini 2.5 Flash | Google DeepMind | 40.9% | gemini-2.5-flash (thinking 16k) | Official | — | Details |
| Claude 3.5 Sonnet | Anthropic | 40.0% | claude-3.6-sonnet | Official | — | Details |
| DeepSeek V3.2 | DeepSeek | 39.5% | deepseek-v3.2-exp (thinking) | Official | — | Details |
| Kimi K2 Instruct | Moonshot AI | 39.4% | kimi-k2 | Official | — | Details |
| GPT-4.1 | OpenAI | 39.0% | gpt-4.1 | Official | — | Details |
| Qwen3 235B A22B Instruct 2507 | Alibaba / Qwen | 38.7% | qwen3-235b-a22b-07-25 | Official | — | Details |
| DeepSeek R1 | DeepSeek | 36.5% | deepseek-r1 | Official | — | Details |
| DeepSeek V3 0324 | DeepSeek | 36.1% | deepseek-v3-0324 | Official | — | Details |
| Llama 4 Maverick | Meta | 24.5% | llama-4-maverick | Official | — | Details |
Each row reports the model’s average accuracy on WeirdML. Click a row for the full run context.