Fugu-MT 論文翻訳(概要): Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

論文の概要: Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

arxiv url: http://arxiv.org/abs/2602.10732v1
Date: Wed, 11 Feb 2026 10:45:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-12 21:44:01.774904
Title: Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling
Title（参考訳）: Macaron:テンプレートフィリングによる多言語・多文化推論のための制御された人文ベンチマーク
Authors: Alaa Elsetohy, Sama Hadhoud, Haryo Akbarianto Wibowo, Chenxi Whitehouse, Genta Indra Winata, Fajri Koto, Alham Fikri Aji,
Abstract要約: 質問言語間の推論型と文化的側面を分解するテンプレートファーストベンチマークを提案する。 7つの推論タイプ、22の文化的側面を含む100の言語に依存しないテンプレートを使用して、ネイティブアノテータはシナリオ整合の英語とローカル言語による多重選択の質問を作成する。
参考スコア（独自算出の注目度）: 34.84162687685434
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.
Abstract（参考訳）: 翻訳されたデータセットは英語中心のシナリオを保持し、文化第一のデータセットは要求される推論の制御を欠くことが多い。質問言語間の推論型と文化的側面を分解するテンプレートファーストのベンチマークであるMacaronを提案する。 7つの推論タイプ、22の文化的側面を含む100の言語に依存しないテンプレートを使用して、ネイティブアノテータはシナリオ整合の英語とローカル言語による多重選択の質問を作成し、体系的にTrue/Falseの質問を導出する。マカロンには、20の国/文化の文脈、10のスクリプト、20の言語(アムハラ語、ヨルバ語、ズール語、キルギス語、およびいくつかのアラビア方言を含む)にまたがる11,862の事例がある。 21の多言語LPMのゼロショット評価では、推論モードモデルは英語とローカル言語の間で最強のパフォーマンスとほぼ平準性を達成し、一方、オープンウェイトモデルはローカル言語では大幅に低下し、T/Fタスクにチャンスに近づいた。文化的な数学的および数え上げテンプレートは、一貫して最も難しい。データは、https://huggingface.co/datasets/AlaaAhmed2444/Macaronでアクセスすることができる。

論文の概要: Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

関連論文リスト