Fugu-MT 論文翻訳(概要): Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

論文の概要: Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

arxiv url: http://arxiv.org/abs/2604.22597v1
Date: Fri, 24 Apr 2026 14:25:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.502023
Title: Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
Title（参考訳）: 数学推論評価の再考:象徴的剛性を超えたロバストなLLM-as-a-Judgeフレームワーク
Authors: Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach, Nimrod Berman, Igor Kviatkovsky,
Abstract要約: そこで本研究では,モデル生成解を評価するために,ルールベースの記号数学比較の代替案を提案する。我々のフレームワークはより信頼性の高い評価とベンチマークを可能にし、より正確なパフォーマンス監視を可能にします。
参考スコア（独自算出の注目度）: 6.81322477138385
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.
Abstract（参考訳）: 大規模言語モデルの最近の進歩は、論理的推論や問題解決においてモデルの知性を評価するために使用される数学的推論など、様々なタスクに重大な改善をもたらした。基礎的真理解に対する最終回答の正しさを検証することにより、数学的推論ベンチマークに基づいてモデルを評価する。この検証の一般的なアプローチはシンボリック数学の比較に基づいており、様々な数学的表現や解形式にまたがる一般化に失敗する。本研究は,規則に基づく記号的数学比較に頑健で柔軟な代替手段を提供する。本研究では, LLMに基づくモデル生成解の評価フレームワークを提案し, 多様な数学的表現や解答形式にまたがる正確な評価を可能にする。我々は、LightevalとSimpleRLという2つの一般的なフレームワークでシンボル評価の失敗事例を示し、それらを我々のアプローチと比較し、一般的に使われているメソッドよりも明らかに改善されていることを示す。我々のフレームワークは、より信頼性の高い評価とベンチマークを可能にし、より正確なパフォーマンスモニタリングを可能にします。

論文の概要: Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

関連論文リスト