BenchmarksAgents
GAIA: A Benchmark for General AI Assistants
GAIA is a benchmark of 450+ real-world questions requiring multi-step reasoning, web browsing, multi-modality handling, and tool use, designed to be easy for humans (~92%) but hard for AI assistants, scored across three difficulty levels.
AgentsaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | Anthropic | 74.55% | — | Unverified | Sep 29, 2025 | Details |
| Claude Opus 4.1 | Anthropic | 68.48% | — | Unverified | Aug 5, 2025 | Details |
| Claude Opus 4 | Anthropic | 64.85% | — | Unverified | May 22, 2025 | Details |
| Claude Haiku 4.5 | Anthropic | 56.36% | — | Unverified | Oct 15, 2025 | Details |
| Claude Mythos Preview | Anthropic | 52.3% | — | Unverified | Apr 7, 2026 | Details |
| GPT-5.4 Pro | OpenAI | 50.5% | — | Unverified | Mar 5, 2026 | Details |
| GPT-4.1 | OpenAI | 50.30% | — | Unverified | Apr 14, 2025 | Details |
| GPT-5.4 | OpenAI | 48.2% | — | Unverified | Mar 5, 2026 | Details |
| Claude Opus 4.6 | Anthropic | 47.8% | — | Unverified | Feb 5, 2026 | Details |
| Gemini 3.1 Pro Preview | Google DeepMind | 46.1% | — | Unverified | Feb 19, 2026 | Details |
| Claude Sonnet 4.6 | Anthropic | 45.5% | — | Unverified | Feb 17, 2026 | Details |
| GPT-5 mini | OpenAI | 44.8% | — | Unverified | Aug 7, 2025 | Details |
| Claude 3.7 Sonnet | Anthropic | 43.9% | — | Unverified | Feb 24, 2025 | Details |
| GPT-5 | OpenAI | 42.1% | — | Unverified | Aug 7, 2025 | Details |
| GPT-5.2 | OpenAI | 40.3% | — | Unverified | Dec 11, 2025 | Details |
| Gemini 3 Pro | Google DeepMind | 38.5% | — | Unverified | Nov 18, 2025 | Details |
| Kimi K2.5 | Moonshot AI | 38.1% | — | Unverified | Jan 27, 2026 | Details |
| Qwen 3.6 Plus | Alibaba / Qwen | 37.4% | — | Unverified | Apr 2, 2026 | Details |
| o4-mini | OpenAI | 36.8% | — | Unverified | Apr 16, 2025 | Details |
| Gemini 3 Flash | Google DeepMind | 35.2% | — | Unverified | Dec 17, 2025 | Details |
| DeepSeek V3.2 | DeepSeek | 34.8% | — | Unverified | Dec 1, 2025 | Details |
| GLM-5 | Z.ai | 33.8% | — | Unverified | Feb 11, 2026 | Details |
| Gemini 2.5 Pro | Google DeepMind | 33.3% | — | Unverified | Mar 25, 2025 | Details |
| o3 | OpenAI | 32.73% | — | Unverified | Apr 16, 2025 | Details |
| Gemini 2.0 Flash | Google DeepMind | 32.73% | — | Unverified | Dec 11, 2024 | Details |
| DeepSeek R1 | DeepSeek | 30.30% | — | Unverified | Jan 20, 2025 | Details |
| DeepSeek V3 | DeepSeek | 29.39% | — | Unverified | Dec 26, 2024 | Details |
| Llama 4 Maverick | Meta | 28.6% | — | Unverified | Apr 5, 2025 | Details |
| DeepSeek V3.1 | DeepSeek | 11.5% | — | Unverified | Aug 21, 2025 | Details |
Each row reports the model’s accuracy on GAIA: A Benchmark for General AI Assistants. Click a row for the full run context.