WeirdML
Tests whether LLMs can do machine learning on novel, unusual datasets: each model writes and iteratively debugs PyTorch code over 5 feedback rounds in a sandboxed GPU container, scored on held-out test accuracy across 17 tasks (6 public, 13 hidden).
What this benchmark measures
Tests whether LLMs can do machine learning on novel, unusual datasets: each model writes and iteratively debugs PyTorch code over 5 feedback rounds in a sandboxed GPU container, scored on held-out test accuracy across 17 tasks (6 public, 13 hidden).
Rows on this page are sourced from public benchmark artifacts, leaderboard exports, or source-linked model reports. Each row keeps benchmark version, source model name, and available run details attached to the score.
The metric shown here is average accuracy. It should be interpreted within WeirdML, not compared as part of a site-wide ranking.
What to be careful about
Scores average the per-run maximum accuracy over 5 iterations across 17 tasks; six original tasks are public and thirteen are a hidden test set. Keep the reasoning setting and bootstrap standard error attached.