Fugu-MT 論文翻訳(概要): Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

論文の概要: Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

arxiv url: http://arxiv.org/abs/2505.14216v1
Date: Tue, 20 May 2025 11:22:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-21 14:49:53.130555
Title: Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning
Title（参考訳）: 強化学習と蒸留:LLM推論における精度と能力の理解
Authors: Minwu Kim, Anubhav Shrestha, Safal Shrestha, Aadim Nepal, Keith Ross,
Abstract要約: 検証可能な報酬(RLVR)による強化学習は全体的な精度を高めるが、能力の向上には失敗することを示す。蒸留は強い推論パターンを学習することで精度を確実に向上するが、新しい知識が導入されたときだけ能力を向上させる。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy but fails to improve capability, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR does not improve capability because it focuses on improving the accuracy of the less-difficult questions to the detriment of the accuracy of the most difficult questions, thereby leading to no improvement in capability. Second, we find that RLVR does not merely increase the success probability for the less difficult questions, but in our small model settings produces quality responses that were absent in its output distribution before training. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, we show that while distillation reliably improves accuracy by learning strong reasoning patterns, it only improves capability when new knowledge is introduced. Moreover, when distilling only with reasoning patterns and no new knowledge, the accuracy of the less-difficult questions improves to the detriment of the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in language models.
Abstract（参考訳）: 近年の研究では、検証可能な報酬(RLVR)による強化学習は全体的な精度を向上するが、蒸留により両者を改善できないことが示されている。本稿では,これらの現象のメカニズムについて考察する。まず,RLVRは,難解な質問の精度向上と難解な質問の精度の低下に重点を置いており,能力向上に繋がらないことを実証する。第二に、RLVRは、難解な質問に対する成功確率を増加させるだけでなく、我々の小さなモデル設定では、トレーニング前に出力分布に欠落していた品質応答を生成する。さらに、これらの応答は明らかに長続きせず、さらにリフレクション関連のキーワードが特徴であり、応答品質のより信頼性の高い指標の必要性が強調されている。第3に,蒸留は強い推論パターンを学習することで精度を確実に向上するが,新しい知識が導入された場合にのみ能力を向上させることが示されている。さらに、推論パターンのみを蒸留し、新しい知識を全く持たない場合、難解でない質問の精度はRLVRと同様、最も難しい質問の減量に向上する。これらの知見は、言語モデルにおけるRLVRと蒸留形状推論の振る舞いについて、より明確な理解を提供するものである。

関連論文リスト

The Invisible Leash: Why RLVR May Not Escape Its Origin [48.915013455847856]
大規模推論モデルの最近の進歩は、AI能力を向上するための有望な方法として、Reinforcement Learning with Verifiable Rewards(RLVR)を強調している。本研究は,RLVRの潜在的な限界に対する新たな洞察を提供する理論的,実証的研究である。エントロピー・リワードのトレードオフは、RLVRが精度を確実に向上させる一方で、探索が徐々に狭くなり、正しく表現されていない解を見落としてしまう可能性がある。
論文参考訳（メタデータ） (2025-07-20T07:04:08Z)
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning [95.28059121743831]
RLVR(Reinforcement Learning with Verifiable Rewards)は、複雑な推論タスクにおいて、大規模言語モデル(LLM)のトレーニングに有効であることが証明されている。本稿では、モデル欠陥を体系的に識別し、それらを問題解決に活用する自己認識弱さ駆動型問題合成フレームワーク(SwS)を提案する。 SwSはモデルを自己識別し、RLの弱点に対処することで堅牢な一般化を可能にし、7Bモデルと32Bモデルで平均パフォーマンスが10.0%と7.7%向上した。
論文参考訳（メタデータ） (2025-06-10T17:02:00Z)
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
大規模強化学習は, 強大・中小モデルの推論能力を大幅に向上させることができることを示す。まずは算数のみのプロンプト、次にコードのみのプロンプトのトレーニングを行う。
論文参考訳（メタデータ） (2025-05-22T08:50:47Z)
Concise Reasoning via Reinforcement Learning [13.657506042120167]
我々は強化学習(RL)の中核的原則を再考する。簡潔さと正確さの自然な相関関係を明らかにする。 RLトレーニングの二次段階の導入は、非常に小さな問題セットを用いて、思考の連鎖を著しく減少させることが示される。
論文参考訳（メタデータ） (2025-04-07T15:35:54Z)
Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory [15.986679553468989]
大規模言語モデル(LLM)は潜在的な知識基盤として有望であることを示している。 LLMは質問応答タスクに苦しむことが多く、幻覚を起こす傾向がある。我々は,検出されたが表現されていない知識を活用することで,解答精度を向上させる手法であるSkipUnsureを開発した。
論文参考訳（メタデータ） (2024-12-30T10:29:18Z)
Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback [14.120154004011084]
LLM(Large Language Models)はしばしば幻覚と呼ばれる誤った出力を生成する。知識フィードバックによる強化学習(Reinforcement Learning from Knowledge Feedback, RLKF)と呼ばれる新しいアライメントフレームワークを提案する。
論文参考訳（メタデータ） (2024-03-27T08:39:56Z)
R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
大きな言語モデル(LLM)は、優れたパフォーマンスで多くのドメインに革命をもたらしたが、それでもその課題に直面している。事前の指導チューニング方法は、モデルが知識を知っているかどうかに関わらず、モデルに文章を完成させるよう強制する。我々はRefusal-Aware Instruction Tuning (R-Tuning)と呼ばれる新しいアプローチを提案する。実験の結果、R-Tuningは、既知の質問に答えたり、未知の質問に答えるのを控えるモデルの能力を効果的に改善することを示した。
論文参考訳（メタデータ） (2023-11-16T08:45:44Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
大規模言語モデルはしばしば「ハロシン化」の課題に直面している本研究では,不確実性に応答してモデルが出力を拡張あるいは拒否することを可能にする,不確実性を考慮したコンテキスト内学習フレームワークを提案する。
論文参考訳（メタデータ） (2023-10-07T12:06:53Z)
CCLF: A Contrastive-Curiosity-Driven Learning Framework for Sample-Efficient Reinforcement Learning [56.20123080771364]
我々は、強化学習のためのモデルに依存しないコントラスト駆動学習フレームワーク(CCLF)を開発した。 CCLFは、サンプルの重要性を完全に活用し、自己管理的な学習効率を向上させる。このアプローチをDeepMind Control Suite、Atari、MiniGridベンチマークで評価する。
論文参考訳（メタデータ） (2022-05-02T14:42:05Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。