evals.report
BenchmarksLabsCompareRun guides

GAIA: A Benchmark for General AI Assistants

GAIA is a benchmark of 450+ real-world questions requiring multi-step reasoning, web browsing, multi-modality handling, and tool use, designed to be easy for humans (~92%) but hard for AI assistants, scored across three difficulty levels.

AgentsaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Claude Sonnet 4.5Anthropic74.55%UnverifiedSep 29, 2025Details
Claude Opus 4.1Anthropic68.48%UnverifiedAug 5, 2025Details
Claude Opus 4Anthropic64.85%UnverifiedMay 22, 2025Details
Claude Haiku 4.5Anthropic56.36%UnverifiedOct 15, 2025Details
Claude Mythos PreviewAnthropic52.3%UnverifiedApr 7, 2026Details
GPT-5.4 ProOpenAI50.5%UnverifiedMar 5, 2026Details
GPT-4.1OpenAI50.30%UnverifiedApr 14, 2025Details
GPT-5.4OpenAI48.2%UnverifiedMar 5, 2026Details
Claude Opus 4.6Anthropic47.8%UnverifiedFeb 5, 2026Details
Gemini 3.1 Pro PreviewGoogle DeepMind46.1%UnverifiedFeb 19, 2026Details
Claude Sonnet 4.6Anthropic45.5%UnverifiedFeb 17, 2026Details
GPT-5 miniOpenAI44.8%UnverifiedAug 7, 2025Details
Claude 3.7 SonnetAnthropic43.9%UnverifiedFeb 24, 2025Details
GPT-5OpenAI42.1%UnverifiedAug 7, 2025Details
GPT-5.2OpenAI40.3%UnverifiedDec 11, 2025Details
Gemini 3 ProGoogle DeepMind38.5%UnverifiedNov 18, 2025Details
Kimi K2.5Moonshot AI38.1%UnverifiedJan 27, 2026Details
Qwen 3.6 PlusAlibaba / Qwen37.4%UnverifiedApr 2, 2026Details
o4-miniOpenAI36.8%UnverifiedApr 16, 2025Details
Gemini 3 FlashGoogle DeepMind35.2%UnverifiedDec 17, 2025Details
DeepSeek V3.2DeepSeek34.8%UnverifiedDec 1, 2025Details
GLM-5Z.ai33.8%UnverifiedFeb 11, 2026Details
Gemini 2.5 ProGoogle DeepMind33.3%UnverifiedMar 25, 2025Details
o3OpenAI32.73%UnverifiedApr 16, 2025Details
Gemini 2.0 FlashGoogle DeepMind32.73%UnverifiedDec 11, 2024Details
DeepSeek R1DeepSeek30.30%UnverifiedJan 20, 2025Details
DeepSeek V3DeepSeek29.39%UnverifiedDec 26, 2024Details
Llama 4 MaverickMeta28.6%UnverifiedApr 5, 2025Details
DeepSeek V3.1DeepSeek11.5%UnverifiedAug 21, 2025Details

Each row reports the model’s accuracy on GAIA: A Benchmark for General AI Assistants. Click a row for the full run context.