BenchmarksMultimodal
Video-MME
A comprehensive evaluation benchmark for multimodal LLMs in video analysis, using 900 videos (254 hours) and 2,700 human-annotated multiple-choice QA pairs across short, medium, and long durations, scored by answer accuracy with and without subtitles.
MultimodalaccuracyHigher is better
| Model | Lab | Score↓ | Source model | Status | Date | |
|---|---|---|---|---|---|---|
| Kimi K2.5 | Moonshot AI | 87.4% | — | Unverified | Jan 27, 2026 | Details |
| Gemini 2.5 Pro | Google DeepMind | 84.8% | — | Verified | Mar 25, 2025 | Details |
| Qwen 3.6 Plus | Alibaba / Qwen | 84.2% | — | Unverified | Apr 2, 2026 | Details |
| Gemini 1.5 Pro | Google DeepMind | 75.0% | — | Official | Feb 15, 2024 | Details |
| GPT-4o | OpenAI | 71.9% | — | Official | May 13, 2024 | Details |
| Claude 3.5 Sonnet | Anthropic | 60.0% | — | Official | Jun 20, 2024 | Details |
Each row reports the model’s accuracy on Video-MME. Click a row for the full run context.