BenchmarksAgents
GAIA: A Benchmark for General AI Assistants
GAIA is a benchmark of 450+ real-world questions requiring multi-step reasoning, web browsing, multi-modality handling, and tool use, designed to be easy for humans (~92%) but hard for AI assistants, scored across three difficulty levels.
AgentsaccuracyHigher is better
No run guide for this benchmark yet.