Fugu-MT 論文翻訳(概要): SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

論文の概要: SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

arxiv url: http://arxiv.org/abs/2603.27977v1
Date: Mon, 30 Mar 2026 02:54:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.20264
Title: SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
Title（参考訳）: SARL:Rewarding Reasoning Topologyによるラベルなし強化学習
Authors: Yifan Wang, Bolian Li, David Cho, Ruqi Zhang, Fanping Sui, Ananth Grama,
Abstract要約: 中間的思考段階から応答ごとの推論マップを構築するラベルフリーフレームワークである構造認識強化学習(SARL)を導入する。 Qwen3-4B実験の結果,SARLは地上の真理に基づくRLと先行のラベルのないRLベースラインを超越していることがわかった。
参考スコア（独自算出の注目度）: 29.219491041433375
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.
Abstract（参考訳）: 強化学習は大きな推論モデルの改善の中心となっているが、その成功は依然として検証可能な報酬や、ラベル付き監視に大きく依存している。これにより、正しさが曖昧で検証できないような開終域への適用が制限される。さらに、推論軌跡はほとんど制約がなく、最終解への最適化は一般化よりも早期に活用できる。本研究では、モデルに何を生み出すか(推論の結果)ではなく、どのように考えるか(推論の構造)を教えることによって、一般的な推論能力を改善することができるのかを問うとともに、従来のRLVRをオープンエンド設定に拡張する。 SARL(Structure aware reinforcement Learning)は、複雑なネットワークと人間の脳の機能的構造にインスパイアされた、中間的な思考段階から応答ごとの推論マップを構築し、その小さな世界トポロジに報いるラベルフリーフレームワークである。 SARLは、局所的に一貫性があり、グローバルに効率的である推論軌道を奨励し、監督を目的地から目的地へとシフトさせる。我々のQwen3-4B実験では、SARLは地上の真理に基づくRLと先行のラベルのないRLベースラインを上回り、PPOが9.1%、GRPOが11.6%、PPOが34.6%、GRPOが30.4%となる。優れた性能に加えて、SARLはより低いKL分岐、より高いポリシーエントロピーを示し、より安定で探索的な訓練と一般化された推論能力を示す。

論文の概要: SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

関連論文リスト