Fugu-MT 論文翻訳(概要): MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

論文の概要: MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

arxiv url: http://arxiv.org/abs/2601.21225v1
Date: Thu, 29 Jan 2026 03:40:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:49.551681
Title: MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation
Title（参考訳）: MGSM-Pro:ロバストな多言語数学的推論評価のための簡易戦略
Authors: Tianyi Xu, Kosei Uemura, Alfred Malengo Kondoro, Tadesse Destaw Belay, Catherine Nana Nyaah Essuman, Ifeoma Okoh, Ganiyat Afolabi, Ayodele Awokoya, David Ifeoluwa Adelani,
Abstract要約: GSM-SymbolicアプローチによるM GSMデータセットの拡張であるM GSM-Proを紹介する。我々のデータセットは、M GSM質問毎に、異なる名前、桁、無関係な文脈で5つのインスタンスを提供する。 9つの言語で評価したところ、多くの低リソース言語は、元のテストセットとは異なる桁のインスタンス化でテストすると、大きなパフォーマンス低下を被ることがわかった。
参考スコア（独自算出の注目度）: 13.39496848562168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.
Abstract（参考訳）: 大規模言語モデルは数学的推論においてかなりの進歩を遂げた。しかし、多言語評価のためのベンチマーク開発は難易度と難易度の両方において英語に遅れを取っている。近年,GSM-Symbolicは,同じ質問の異なるインスタンス化に対してモデルを評価する場合,高いばらつきの強い証拠を示したが,その評価は英語でのみ行われた。本稿では,MGSM-SymbolicアプローチによるMGSMデータセットの拡張であるMGSM-Proを紹介する。我々のデータセットは、MGSM質問毎に異なる名前、桁、無関係な文脈で5つのインスタンスを提供する。 9つの言語で評価したところ、多くの低リソース言語は、元のテストセットとは異なる桁のインスタンス化でテストすると、大きなパフォーマンス低下を被ることがわかった。さらに、いくつかのプロプライエタリなモデル、特に Gemini 2.5 Flash と GPT-4.1 は桁のインスタンス化に弱いが、Claude 4.0 Sonnet はより堅牢である。オープンモデルの中で、GPT-OSS 120BとDeepSeek V3は強い堅牢性を示している。これらの結果に基づいて,少なくとも5桁の異なるインスタンス化を用いて各問題を評価し,より堅牢で現実的な算数推理値を求めることを推奨する。

論文の概要: MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

関連論文リスト