Fugu-MT 論文翻訳(概要): KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

論文の概要: KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

arxiv url: http://arxiv.org/abs/2604.12627v1
Date: Tue, 14 Apr 2026 11:53:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.420443
Title: KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance
Title（参考訳）: KnowRL:最小限の知識指導による強化学習によるLLM推論の促進
Authors: Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, Hua Wu,
Abstract要約: 我々は、ヒントデザインを最小限のガイダンス問題として扱うRLトレーニングフレームワークであるtextbfKnowRL(Knowledge-Guided Reinforcement Learning)を提案する。 KnowRLは、ガイダンスを原子知識ポイント(KP)に分解し、制約付きサブセットサーチ(CSS)を使用して、訓練用にコンパクトで対話対応のサブセットを構築する。 1.5Bスケールの8つの推論ベンチマークで、KnowRL-Nemotron-1.5Bは強いRLを一貫して上回り、ベースラインを示唆している。
参考スコア（独自算出の注目度）: 50.70511573232489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose \textbf{KnowRL} (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox -- removing one KP may help while removing multiple such KPs can hurt -- and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.
Abstract（参考訳）: RLVRは大規模言語モデルの推論を改善するが、その有効性は難しい問題に対する厳格な報酬の分散によって制限されることが多い。最近のヒントベースのRLメソッドは、部分的なソリューションや抽象的なテンプレートを注入することで、疎結合を緩和するが、一般的には、冗長性、不整合、追加のトレーニングオーバーヘッドをもたらすトークンを追加することで、ガイダンスをスケールする。我々は,ヒントデザインを最小限のガイダンス問題として扱うRL学習フレームワークであるtextbf{KnowRL} (Knowledge-Guided Reinforcement Learning)を提案する。 RLトレーニング中、KnowRLは、ガイダンスを原子知識ポイント(KP)に分解し、制約付きサブセットサーチ(CSS)を使用して、トレーニング用にコンパクトで対話対応のサブセットを構築する。さらに、プルーニング相互作用パラドックス(pruning interaction paradox) -- 1つのKPを取り除き、複数のKPを取り除き、傷つく可能性がある -- を特定し、依存関係構造の下で堅牢なサブセットのキュレーションを明示的に最適化する。我々は OpenMath-Nemotron-1.5B から KnowRL-Nemotron-1.5B を訓練する。 1.5Bスケールの8つの推論ベンチマークで、KnowRL-Nemotron-1.5Bは強いRLを一貫して上回り、ベースラインを示唆している。 KPのヒントがなければ、KnowRL-Nemotron-1.5Bの平均精度は70.08に達し、すでにNemotron-1.5Bを+9.63ポイント上回っている。モデル、キュレートされたトレーニングデータ、コードはhttps://github.com/Hasuer/KnowRL.comで公開されている。

論文の概要: KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

関連論文リスト