BenchmarksCoding
SWE-bench Multimodal
A software-engineering benchmark of 517 real GitHub issues from visual JavaScript/web projects that include visual context (screenshots, UI mockups, diagrams), measuring whether AI systems can resolve issues whose fixes are verified by the repository's tests.
Coding% resolvedHigher is better
No run guide for this benchmark yet.