Fugu-MT 論文翻訳(概要): Reinforced Efficient Reasoning via Semantically Diverse Exploration

論文の概要: Reinforced Efficient Reasoning via Semantically Diverse Exploration

arxiv url: http://arxiv.org/abs/2601.05053v1
Date: Thu, 08 Jan 2026 15:56:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 17:01:53.266561
Title: Reinforced Efficient Reasoning via Semantically Diverse Exploration
Title（参考訳）: Semantically Diverse Explorationによる強化高効率推論
Authors: Ziqi Zhao, Zhaochun Ren, Jiahong Zou, Liu Yang, Zhiwei Xu, Xuri Ge, Zhumin Chen, Xinyu Ma, Daiting Shi, Shuaiqiang Wang, Dawei Yin, Xin Xin,
Abstract要約: 検証可能な報酬(RLVR)による強化学習は,大規模言語モデル(LLM)の推論の強化に有効であることが証明された。本研究では,LLMのための意味的多様性探索,すなわちROSEによる効率的な推論手法を提案する。本手法は,意味エントロピーに基づく分岐戦略と$varepsilon$-exploration機構を組み込んだものである。
参考スコア（独自算出の注目度）: 73.41112984160992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)-based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree-based reasoning rollouts that enable fine-grained and segment-level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic-entropy-based branching strategy and an $\varepsilon$-exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive reasoning paths, whereas the latter stochastically initiates reasoning rollouts from the root, preventing the search process from becoming overly local. To improve efficiency, we design a length-aware segment-level advantage estimator that rewards concise and correct reasoning while penalizing unnecessarily long reasoning chains. Extensive experiments on various mathematical reasoning benchmarks with Qwen and Llama models validate the effectiveness and efficiency of ROSE. Codes are available at https://github.com/ZiqiZhao1/ROSE-rl.
Abstract（参考訳）: 検証可能な報酬(RLVR)による強化学習は,大規模言語モデル(LLM)の推論の強化に有効であることが証明されている。 Monte Carlo Tree Search (MCTS)ベースの拡張は、細粒度とセグメントレベルのクレジット割り当てを可能にするツリーベースの推論ロールアウトを提供することで、バニラRLVR(例えばGRPO)を改善する。しかし、既存の手法は探索の多様性と非効率な推論に悩まされている。以上の課題に対処するために,LLMに対する意味論的に多様な探索,すなわちROSEによる効率的な推論を提案する。より多様な推論探索を促進するため,本手法は意味エントロピーに基づく分岐戦略と$\varepsilon$-exploration機構を組み込んだ。前者は、意味的不確実性を捉えるために既にサンプリングされた推論ロールアウトを実行し、意味的分岐性の高い分岐点を選択して、新しい連続した推論パスを生成し、後者は、ルートからの推論ロールアウトを確率的に開始し、探索プロセスが過度に局所化するのを防ぐ。効率を向上させるために,不必要に長い推論連鎖をペナルティ化しながら,簡潔かつ正確な推論を報いる長さ認識セグメントレベルの優位性推定器を設計する。 Qwen と Llama のモデルを用いた様々な数学的推論ベンチマークの大規模な実験により、ROSE の有効性と効率が検証された。コードはhttps://github.com/ZiqiZhao1/ROSE-rlで公開されている。

論文の概要: Reinforced Efficient Reasoning via Semantically Diverse Exploration

関連論文リスト