evals.report
BenchmarksLabsCompareRun guides

SWE-bench Multimodal

A software-engineering benchmark of 517 real GitHub issues from visual JavaScript/web projects that include visual context (screenshots, UI mockups, diagrams), measuring whether AI systems can resolve issues whose fixes are verified by the repository's tests.

Coding% resolvedHigher is better

No run guide for this benchmark yet.