Fugu-MT 論文翻訳(概要): Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

論文の概要: Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

arxiv url: http://arxiv.org/abs/2508.16949v1
Date: Sat, 23 Aug 2025 08:47:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.269164
Title: Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Title（参考訳）: 一般LLM推論のためのルーブリック・スカフォールド強化学習
Authors: Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song,
Abstract要約: 大規模言語モデル(LLM)は、推論能力の出現を促進するために強化学習(RL)の可能性を強調している。大規模言語モデル(LLM)の最近の進歩は、推論能力の出現を促進するためにRLの可能性を強調している。本稿では,探索のボトルネックを突破するための新しい指導的足場構築フレームワークを提案する。
参考スコア（独自算出の注目度）: 43.585741773885424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen-2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩は、推論能力の出現を促進するために強化学習(RL)の可能性を強調している。奨励的な結果にもかかわらず、RLの改善は高品質なサンプルからの学習に依存しているため、基本的なジレンマは継続するが、そのようなサンプルの探索はLLMの固有の制限によって制限されている。これは事実上、探索できないものは学べない、望ましくないサイクルを生み出す。本研究では,LLM推論の探索ボトルネックを突破する新しい指導用足場フレームワークであるRubric-Scaffolded Reinforcement Learning (RuscaRL)を提案する。特に、RuscaRLでは、(1)ロールアウト生成時の探索のための明示的な足場としてチェックリストスタイルのルーリックを導入している。このガイダンスは、時間とともに徐々に減衰し、モデルに基礎となる推論パターンを内在化させるよう促す; (2) モデルトレーニング中の搾取に対する検証可能な報酬を与える; ルーブリックを基準として堅牢なLCM-as-a-Judgeスコアを得ることができ、一般的な推論タスクにおいて有効なRLを可能にする。広範囲な実験により、様々なベンチマークで提案されたRuscaRLの優位性を実証し、Nの最良の評価の下で推論境界を効果的に拡張した。特に、RuscaRLはQwen-2.5-7B-InstructをHealthBench-500で23.6から50.3に格上げし、GPT-4.1を上回った。さらに,Qwen3-30B-A3B-InstructではHealthBench-500で61.1を達成し,OpenAI-o3などのLLMよりも優れていた。

論文の概要: Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

関連論文リスト