evals.report
BenchmarksLabsCompareRun guides

WeirdML

Tests whether LLMs can do machine learning on novel, unusual datasets: each model writes and iteratively debugs PyTorch code over 5 feedback rounds in a sandboxed GPU container, scored on held-out test accuracy across 17 tasks (6 public, 13 hidden).

Codingaverage accuracyHigher is better

No run guide for this benchmark yet.