Fugu-MT 論文翻訳(概要): MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

論文の概要: MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

arxiv url: http://arxiv.org/abs/2510.06430v1
Date: Tue, 07 Oct 2025 20:09:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.183242
Title: MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning
Title（参考訳）: MathRobust-LV:Mathematical Reasoningにおける大言語モデルの言語変化に対するロバスト性の評価
Authors: Neeraja Kirtane, Yuvraj Khanna, Peter Relan,
Abstract要約: 大規模言語モデルは数学のベンチマークで優れているが、それらの数学は言語的変動に頑健性をもたらす。そこで本研究では, インストラクタが評価にまたがってどのように問題を言い換えるかを反映した, テストセットと評価手法であるMathRobust-LVを紹介する。結果から,言語的変化に対する頑健性は基本的な課題であり,モデルに脆弱性があることが示唆された。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work increasingly treats high-difficulty competitions like the IMO as the gold standard for evaluating reasoning, we believe in comprehensive benchmarking of high school-level math problems in real educational settings. We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments while keeping difficulty constant: we change surface details (names, contexts, variables) while preserving numerical structure and answers. In contrast to prior efforts that alter problem content or emphasize IMO-level tasks, we focus on high-school-level dataset problems at the difficulty level where models are currently deployed in educational settings: tutoring and assessment systems. In these applications, instructors rephrase identical concepts in varied ways, making linguistic robustness essential for reliable deployment. Although MATH data benchmarking is often regarded as saturated, our experiment on 34 models reveals that accuracy declines when moving from the baseline to the variants. These drops are severe for smaller models (9-11%) while stronger models also show measurable degradation. Frontier models like GPT-5, Gemini-2.5pro remain comparatively stable. Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.
Abstract（参考訳）: 大規模言語モデルは数学のベンチマークで優れているが、それらの数学は言語的変動に頑健性をもたらす。最近の研究は、IMOのような高度な競争を推論を評価するための金の標準として扱う傾向にあるが、我々は、実際の教育環境での高校レベルの数学問題の総合的なベンチマークを信じている。本研究では, 数値構造と回答を保存しながら, 表面的詳細(名前, 文脈, 変数)を変化させながら, インストラクターが, 難易度を一定に保ちながら, 評価間でどのように問題を言い換えるかを反映した, テストセットと評価手法であるMathRobust-LVを紹介する。問題内容を変えたり、IMOレベルのタスクを強調したりする以前の取り組みとは対照的に、我々は、モデルが現在教育環境にデプロイされている難易度において、高校レベルのデータセット問題に焦点を当てている。これらの応用において、インストラクターは同じ概念を様々な方法で表現し、信頼性の高いデプロイメントに言語的堅牢性が不可欠である。 MATHデータベンチマークはしばしば飽和状態と見なされるが、34モデルによる実験により、ベースラインから変種へ移動すると精度が低下することが示された。これらの低下は、より小さなモデル(9-11%)に対して深刻であるが、強いモデルもまた測定可能な劣化を示す。 GPT-5やGemini-2.5proのような最前線モデルは比較的安定している。結果から,言語的変化に対する頑健性は基本的な課題であり,モデルに脆弱性があることが示唆された。

論文の概要: MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

関連論文リスト