Fugu-MT 論文翻訳(概要): You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

論文の概要: You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

arxiv url: http://arxiv.org/abs/2511.04902v1
Date: Fri, 07 Nov 2025 01:05:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-10 21:00:44.632679
Title: You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models
Title（参考訳）: 推論を学ぶための推論:弱ベースモデルにおけるラベルなしRLの限界
Authors: Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei,
Abstract要約: 限定的な推論能力を持つベースモデルに対するラベルフリーなRLアプローチの一般化可能性について検討する。ラベルのないRLは,既存の推論能力に大きく依存していることがわかった。本稿では,カリキュラム学習を利用して難解な問題を段階的に導入するラベルフリーRLの簡易かつ効果的な手法を提案する。
参考スコア（独自算出の注目度）: 12.14455026524814
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa
Abstract（参考訳）: 大規模言語モデルの最近の進歩は、外部の監督なしに推論能力を向上させるための教師なし強化学習(RL)手法の可能性を実証している。しかし、これらのラベルフリーなRLアプローチの、推論能力に制限のあるより小さなベースモデルへの一般化性は、まだ明らかになっていない。本研究では, ラベルフリーRL法の性能を, 0.5B から 7B まで異なるモデルサイズ, 推論強度で系統的に検討した。ラベルのないRLは、ベースモデルの既存の推論能力に大きく依存しており、より弱いモデルのベースラインレベル以下で性能が劣化することが多い。より小さなモデルでは、効果的な自己回帰を可能にするために十分に長い、あるいは多様な連鎖推論を生成することができず、トレーニングデータの難しさが成功を決定する上で重要な役割を担っていることが分かっています。これらの課題に対処するために,カリキュラム学習を利用したラベルなしRLの簡易かつ効果的な手法を提案する。さらに,事前定義された困難を伴うサンプルを生成するためのデータキュレーションパイプラインも導入する。提案手法は,すべてのモデルサイズと推論能力において一貫した改善を示し,リソース制約モデルにおける推論能力のブートストラップを可能にする,より堅牢な教師なしRLへの道を提供する。コードはhttps://github.com/BorealisAI/CuMaで公開しています。

論文の概要: You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

関連論文リスト