evals.report
BenchmarksLabsCompareRun guides
BenchmarksMultimodal

Video-MME

A comprehensive evaluation benchmark for multimodal LLMs in video analysis, using 900 videos (254 hours) and 2,700 human-annotated multiple-choice QA pairs across short, medium, and long durations, scored by answer accuracy with and without subtitles.

MultimodalaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Kimi K2.5Moonshot AI87.4%UnverifiedJan 27, 2026Details
Gemini 2.5 ProGoogle DeepMind84.8%VerifiedMar 25, 2025Details
Qwen 3.6 PlusAlibaba / Qwen84.2%UnverifiedApr 2, 2026Details
Gemini 1.5 ProGoogle DeepMind75.0%OfficialFeb 15, 2024Details
GPT-4oOpenAI71.9%OfficialMay 13, 2024Details
Claude 3.5 SonnetAnthropic60.0%OfficialJun 20, 2024Details

Each row reports the model’s accuracy on Video-MME. Click a row for the full run context.