evals.report
BenchmarksLabsCompareRun guides
BenchmarksReasoning

MultiNRC

A native (non-translated) multilingual reasoning benchmark of 1,000+ questions written by native speakers in French, Spanish, and Chinese across four categories (language-specific linguistic reasoning, wordplay/riddles, cultural/tradition reasoning, and culturally relevant math), scoring LLMs on accuracy.

ReasoningaccuracyHigher is better
ModelLabScoreSource modelStatusDate
Gemini 3.1 Pro PreviewGoogle DeepMind64.74%OfficialFeb 19, 2026Details
GPT-5.4 ProOpenAI62.27%OfficialMar 5, 2026Details
Muse SparkMeta59.05%OfficialApr 8, 2026Details
Gemini 3 ProGoogle DeepMind58.96%OfficialNov 18, 2025Details
GPT-5.4OpenAI58.29%OfficialMar 5, 2026Details
Claude Opus 4.6Anthropic57.06%OfficialFeb 5, 2026Details
GPT-5OpenAI52.13%OfficialAug 7, 2025Details
GPT-5.1OpenAI49.00%OfficialNov 12, 2025Details
OpenAI o3-proOpenAI49.00%OfficialJun 10, 2025Details
Claude Opus 4.5Anthropic48.63%OfficialNov 24, 2025Details
o3OpenAI45.50%OfficialApr 16, 2025Details
Gemini 2.5 ProGoogle DeepMind45.12%OfficialMar 25, 2025Details
GPT-5.2OpenAI42.18%OfficialDec 11, 2025Details
Claude Opus 4.1Anthropic38.39%OfficialAug 5, 2025Details
Claude Sonnet 4.5Anthropic35.83%OfficialSep 29, 2025Details
Kimi K2.5Moonshot AI35.17%OfficialJan 27, 2026Details
Claude Opus 4Anthropic33.93%OfficialMay 22, 2025Details
Claude 3.7 SonnetAnthropic27.77%OfficialFeb 24, 2025Details
DeepSeek R1DeepSeek24.27%OfficialJan 20, 2025Details
GPT-5 miniOpenAI23.89%OfficialAug 7, 2025Details
DeepSeek V3.1DeepSeek23.60%OfficialAug 21, 2025Details
o4-miniOpenAI22.18%OfficialApr 16, 2025Details
GPT-4.1OpenAI21.23%OfficialApr 14, 2025Details
Claude Sonnet 4Anthropic18.39%OfficialMay 22, 2025Details
GPT-OSS-120BOpenAI15.17%OfficialAug 5, 2025Details
GPT-4oOpenAI12.42%OfficialMay 13, 2024Details
Llama 4 MaverickMeta8.44%OfficialApr 5, 2025Details

Each row reports the model’s accuracy on MultiNRC. Click a row for the full run context.