Fugu-MT 論文翻訳(概要): Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs

論文の概要: Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs

arxiv url: http://arxiv.org/abs/2509.21044v1
Date: Thu, 25 Sep 2025 11:51:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.879475
Title: Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs
Title（参考訳）: LLMの内部回路における活性化強度と多様性の強化学習
Authors: Honglin Zhang, Qianyue Hao, Fengli Xu, Yong Li,
Abstract要約: 大規模言語モデル(LLM)は、大規模な事前学習を通じて広範囲の事前知識を取得し、教師付き微調整(SFT)や強化学習(RL)ベースの後訓練によってさらに強化することができる。 RL微細調整は、SFT単独で達成した以上のLLMの能力を向上することを示す証拠が増えている。しかし、RL微調整が固有の特性の異なる様々なLLMの能力を高めるメカニズムは未解明のままである。
参考スコア（独自算出の注目度）: 13.036236161537147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families shows two robust effects of online RL post-training: (i) an overall increase in activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://anonymous.4open.science/r/llm_rl_probing_analysis-F673.
Abstract（参考訳）: 大規模言語モデル(LLM)は、大規模な事前学習を通じて広範囲の事前知識を取得し、教師付き微調整(SFT)や強化学習(RL)ベースの後訓練によってさらに強化することができる。 RL微細調整は、SFT単独で達成した以上のLLMの能力を向上することを示す証拠が増えている。しかし、RL微調整が固有の特性の異なる様々なLLMの能力を高めるメカニズムは未解明のままである。本研究では, エッジ属性パッチ(EAP)に関する先行研究からインスピレーションを得て, RL微調整前後におけるLDMの内部的差異について検討した。複数のモデルファミリーを対象とした分析では、オンラインRLポストトレーニングの2つのロバストな効果が示されている。 (i)活性化強度の全体的な増加は、より多くの内部経路が関与し、その信号がより強くなることを示し、 (II) 活性化パターンの多様性は, 高エントロピーと低濃度エッジ分布によって反映される。これらの変化は、RLが情報フローをより冗長かつ柔軟にし、一般化の優位性を説明することを示唆している。特に、直接選好最適化(DPO)で微調整されたモデルはこれらの傾向から逸脱し、PPOやGRPOベースのトレーニングと比べてかなり弱いか矛盾した内部変化を示す。この結果から,LL の微調整が LLM の内部回路を系統的にどう変更するかを統一的に把握し,オンライン RL と嗜好に基づくアプローチの方法論的区別を強調した。私たちのコードはhttps://anonymous.4open.science/r/llm_rl_probing_analysis-F673でオープンソースです。

論文の概要: Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs

関連論文リスト