Fugu-MT 論文翻訳(概要): Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

論文の概要: Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

arxiv url: http://arxiv.org/abs/2505.22756v1
Date: Wed, 28 May 2025 18:18:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-30 18:14:07.468877
Title: Decomposing Elements of Problem Solving: What "Math" Does RL Teach?
Title（参考訳）: 問題解決の要素を分解する: RLが教える"マスト"とは?
Authors: Tian Qin, Core Francisco Park, Mujin Kwun, Aaron Walsman, Eran Malach, Nikhil Anand, Hidenori Tanaka, David Alvarez-Melis,
Abstract要約: 私たちは問題解決を、計画、実行、検証という基本的な機能に分解します。 RLを訓練したモデルは、計画スキルが不十分なため、基本的に新しい問題に悩まされ、"カバーウォール"にぶつかっていることを示す。本研究は, LLM推論の強化におけるRLの役割について考察し, 重要な限界を明らかにするとともに, これらの障壁を克服するための道筋を示唆する。
参考スコア（独自算出の注目度）: 22.517954679764244
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.
Abstract（参考訳）: 数学的推論タスクは、特にGRPOのような強化学習(RL)法では、LLMの推論能力を評価するための顕著なベンチマークとなっている。しかし、精度の指標だけでは、能力のきめ細かい評価をサポートしておらず、どの問題解決スキルが内部化されているかを明らかにすることができない。これらの能力をよりよく理解するために、我々は問題解決を基本的な機能に分解することを提案する: 計画(質問をステップのシーケンスにマッピングする)、実行(正しい解ステップを実行する)、検証(解の正しさを識別する)。実験により,GRPOは温度蒸留と呼ばれる現象の解法をモデルがすでに知っている問題に対して,実行スキル向上による堅牢性の向上を主目的とすることがわかった。さらに重要なことは、RLで訓練されたモデルは、計画スキルが不十分なため、基本的に新しい問題に苦しむことを示します。より深くRLの影響を探るため、数学的問題解決のアナロジーとして、最小限の合成解木ナビゲーションタスクを構築した。この制御されたセットアップは、経験的な結果を再現し、RLが主に実行の堅牢性を高めることを確認します。重要なことは、この環境では、RLが探索を改善し、新しい解経路への一般化を通じて、カバーウォールを克服できる可能性のある条件を特定することである。本研究は, LLM推論の強化におけるRLの役割について考察し, 重要な限界を明らかにするとともに, これらの障壁を克服するための道筋を示唆する。コードはhttps://github.com/cfpark00/RL-Wallで入手できる。

論文の概要: Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

関連論文リスト