WeirdML

Name: WeirdML
Creator: evals.report

Tests whether LLMs can do machine learning on novel, unusual datasets: each model writes and iteratively debugs PyTorch code over 5 feedback rounds in a sandboxed GPU container, scored on held-out test accuracy across 17 tasks (6 public, 13 hidden).

Codingaverage accuracyHigher is better

Scores About Run this benchmark

No run guide for this benchmark yet.