BenchmarksCoding
WeirdML
Tests whether LLMs can do machine learning on novel, unusual datasets: each model writes and iteratively debugs PyTorch code over 5 feedback rounds in a sandboxed GPU container, scored on held-out test accuracy across 17 tasks (6 public, 13 hidden).
Codingaverage accuracyHigher is better
No run guide for this benchmark yet.