evals.report
BenchmarksLabsCompareRun guides

WeirdML

Tests whether LLMs can do machine learning on novel, unusual datasets: each model writes and iteratively debugs PyTorch code over 5 feedback rounds in a sandboxed GPU container, scored on held-out test accuracy across 17 tasks (6 public, 13 hidden).

Codingaverage accuracyHigher is better
ModelLabScoreSource modelStatusDate
GPT-5.5OpenAI84.9%gpt-5.5 (xhigh)OfficialDetails
Claude Opus 4.8Anthropic82.9%claude-opus-4.8 (xhigh)OfficialDetails
Claude Opus 4.6Anthropic77.9%claude-opus-4.6 (high)OfficialDetails
GPT-5.3-CodexOpenAI77.9%gpt-5.3-codex (xhigh)OfficialDetails
GPT-5.4OpenAI77.7%gpt-5.4 (xhigh)OfficialDetails
Claude Opus 4.7Anthropic76.4%claude-opus-4.7 (high)OfficialDetails
GPT-5.2OpenAI72.2%gpt-5.2 (xhigh)OfficialDetails
Gemini 3.1 Pro PreviewGoogle DeepMind72.1%gemini-3.1-pro-preview (high)OfficialDetails
Gemini 3 ProGoogle DeepMind69.9%gemini-3-pro-preview (high)OfficialDetails
Claude Sonnet 4.6Anthropic66.1%claude-sonnet-4.6 (medium)OfficialDetails
Claude Opus 4.5Anthropic63.7%claude-opus-4.5 (high, 16k)OfficialDetails
Gemini 3.5 FlashGoogle DeepMind62.6%gemini-3.5-flash (high)OfficialDetails
Gemini 3 FlashGoogle DeepMind61.6%gemini-3-flash-preview (high)OfficialDetails
GPT-5.1OpenAI60.8%gpt-5.1 (high)OfficialDetails
GPT-5OpenAI60.7%gpt-5 (high)OfficialDetails
GLM-5.1Z.ai57.1%glm-5.1OfficialDetails
Kimi K2.6Moonshot AI55.9%kimi-k2.6OfficialDetails
Gemini 2.5 ProGoogle DeepMind54.0%gemini-2.5-pro (thinking 16k)OfficialDetails
GPT-5 miniOpenAI52.7%gpt-5-mini (high)OfficialDetails
o4-miniOpenAI52.6%o4-mini (high)OfficialDetails
o3OpenAI52.4%o3 (high)OfficialDetails
Grok 4.20 beta reasoningxAI52.3%grok-4.20-betaOfficialDetails
Grok 4.3xAI49.9%grok-4.3OfficialDetails
DeepSeek V4 ProDeepSeek48.9%deepseek-v4-pro (max)OfficialDetails
GLM-5Z.ai48.2%glm-5 (thinking)OfficialDetails
GPT-OSS-120BOpenAI48.2%gpt-oss-120b (high)OfficialDetails
Claude Sonnet 4.5Anthropic47.7%claude-sonnet-4.5 (thinking 16k)OfficialDetails
Claude Sonnet 4Anthropic46.1%claude-4-sonnet (thinking 16k)OfficialDetails
Claude Opus 4.1Anthropic45.9%claude-opus-4.1 (thinking 16k)OfficialDetails
Grok 4xAI45.7%grok-4-07-09OfficialDetails
Kimi K2.5Moonshot AI45.6%kimi-k2.5OfficialDetails
Claude Haiku 4.5Anthropic45.4%claude-haiku-4.5 (no thinking)OfficialDetails
Claude Opus 4Anthropic43.7%claude-4-opus (thinking 16k)OfficialDetails
Qwen 3 Coder 480BAlibaba / Qwen41.2%qwen3-coderOfficialDetails
Gemini 2.5 FlashGoogle DeepMind40.9%gemini-2.5-flash (thinking 16k)OfficialDetails
Claude 3.5 SonnetAnthropic40.0%claude-3.6-sonnetOfficialDetails
DeepSeek V3.2DeepSeek39.5%deepseek-v3.2-exp (thinking)OfficialDetails
Kimi K2 InstructMoonshot AI39.4%kimi-k2OfficialDetails
GPT-4.1OpenAI39.0%gpt-4.1OfficialDetails
Qwen3 235B A22B Instruct 2507Alibaba / Qwen38.7%qwen3-235b-a22b-07-25OfficialDetails
DeepSeek R1DeepSeek36.5%deepseek-r1OfficialDetails
DeepSeek V3 0324DeepSeek36.1%deepseek-v3-0324OfficialDetails
Llama 4 MaverickMeta24.5%llama-4-maverickOfficialDetails

Each row reports the model’s average accuracy on WeirdML. Click a row for the full run context.