Fugu-MT 論文翻訳(概要): RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

論文の概要: RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

arxiv url: http://arxiv.org/abs/2511.04120v1
Date: Thu, 06 Nov 2025 07:10:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.340217
Title: RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning
Title（参考訳）: RIDE:数学的推論のための項目応答理論を用いた摂動の難しさ
Authors: Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao, Jipeng Zhang, Yunshi Lan,
Abstract要約: 大規模言語モデル (LLM) は数学的推論において高い性能を達成する。現在の規則に基づく摂動法は、しばしば不適切な質問を発生させる。本稿では,新しい逆問題書き換えフレームワーク RIDE を提案する。
参考スコア（独自算出の注目度）: 26.91583214616048
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.
Abstract（参考訳）: 大規模言語モデル(LLM)は、数学的推論において高い性能を達成するが、これらの結果は真の推論ではなく、データ漏洩や表面パターンマッチングの訓練によって膨らませることができる。この目的のためには、真の数学的推論能力を測定するために、逆摂動に基づく評価が必要である。現在の規則に基づく摂動法は、しばしば不適切な質問を発生させ、問題の難しさとベンチマークの進化の体系的な評価を妨げる。このギャップを埋めるため,本論文では,項目応答理論(IRT)を利用して質問の難易度を厳格に測定し,本質的により困難で適切な数学的問題を生成するための,新たな逆問題書き換えフレームワーク RIDE を提案する。学生をシミュレートし,回答から難易度ランク付けを行うために35個のLLMを使用している。このランク付け器は、強化学習中に報酬信号を提供し、難易度にまたがる既存の質問を書き換えるための質問書作成モデルを導出する。 RIDEを競合レベルの数学ベンチマークに適用すると、高度なLLM性能を低下させる摂動バージョンが得られ、26モデルで平均21.73%の低下を示し、数学的推論において限られたロバスト性を示し、評価手法の有効性を確認することができる。

論文の概要: RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

関連論文リスト