Fugu-MT 論文翻訳(概要): RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

論文の概要: RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

arxiv url: http://arxiv.org/abs/2509.21128v1
Date: Thu, 25 Sep 2025 13:18:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.926653
Title: RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
Title（参考訳）: RL Squeezes, SFT Expands: Reasoning LLMの比較研究
Authors: Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo,
Abstract要約: 大規模言語モデル(LLM)は通常、推論能力を改善するために、検証可能な報酬(RLVR)を持つ強化学習(RL)によって訓練される。本稿では,各学習過程において,推論経路を定量化し,定性的な変化を捉える新しい分析フレームワークを提案する。
参考スコア（独自算出の注目度）: 40.196347794452485
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.
Abstract（参考訳）: 大規模言語モデル (LLM) は、典型的には強化学習 (RL) によって訓練され、検証可能な報酬 (RLVR) と、推論能力を改善するためのトレースの教師付き微調整 (SFT) によって訓練される。しかし、これらの手法がどのように推論能力を形成するかはほとんど解明されていない。本稿では,これらの2つの要素が推論過程をどう表現するかを精度的に調査するだけでなく,各学習過程における推論経路を定量化し,定性的な変化を捉える新しい分析フレームワークを提案する(数学的領域における1.5B,7B,14Bのモデルを用いた)。具体的には、完全な推論出力を調べる軌道レベルと、ノードが個々の推論ステップに対応する推論グラフを分析するステップレベルである。特に、ユニークな推論軌跡のクラスタリングは相補的な効果を示す: RLは誤った軌跡を圧縮するが、SFTは正しい軌跡を拡大する。ステップレベル解析により、RLは急上昇(約2.5倍)し、SFTは平坦化(約3分の1に減少)し、ノード訪問頻度、次数、および推論グラフにおける間隙中心性分布の減衰速度が低下することが明らかになった。これは、RLが推論機能を小さなステップのサブセットに集中していることを示し、SFTはそれを多くのステップでホモジェナイズすることを示している。さらに、複数の視点から推論グラフトポロジを評価することにより、RLとSFTの共通性と相違点の特徴を明確化する。本稿では,SFTとRLを併用した2段階学習の実践が成功し,データ構築とより効率的な学習アプローチに実践的な影響をもたらす理由を述べる。

論文の概要: RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

関連論文リスト