Fugu-MT 論文翻訳(概要): Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

論文の概要: Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

arxiv url: http://arxiv.org/abs/2606.17735v1
Date: Tue, 16 Jun 2026 09:55:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.383923
Title: Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs
Title（参考訳）: 自己回帰曲線の破砕: LLMにおける動的てんかん性エントロピーオーケストレーション型除菌強化学習
Authors: Ziliang Wang, Kang An, Faqiang Qian, Jialu Cai, Cijun Ouyang, Yuhang Wang, Qibing Ren, Yichao Wu,
Abstract要約: 長期論理的推論のための消去可能な強化学習を提案する。 $textE3textRL$は、モデルの内在的なローカル自己回帰的クロスエントロピーを基盤にすることで、外部信号への依存を排除します。 DeepMath-103kデータセットで$textE3textRL$をトレーニングします。
参考スコア（独自算出の注目度）: 21.321550377588427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning ($\text{E}^3\text{RL}$). $\text{E}^3\text{RL}$ eliminates reliance on external signals by grounding the model's endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty. By introducing segment-level adaptive dynamic thresholds and advantage allocation, $\text{E}^3\text{RL}$ enables the model to precisely excise localized logical defects while reusing historical key-value (KV) cache streams, thereby endowing the reasoning process with a self-healing capability. We train $\text{E}^3\text{RL}$ on the DeepMath-103k dataset. Experimental results show that $\text{E}^3\text{RL}$ reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead. On mathematical reasoning benchmarks such as AIME, $\text{E}^3\text{RL}$ achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state-of-the-art (SOTA) results by 5.349\% and 6.514\%, respectively. These findings suggest that $\text{E}^3\text{RL}$ shatters the autoregressive curse in long-sequence reasoning and establishes a theoretical and systems-level foundation for the next generation of self-healing artificial general intelligence (AGI).
Abstract（参考訳）: 強化学習(RL)は、大きな言語モデル(LLM)の認知的境界を広げているが、長い水平論理的推論において自己回帰的呪いに弱いままである。この自己回帰的カスケードを克服するために、一つの早期誤りがその後のすべての推論ステップを損なうことができるように、動的にエピステマ性エントロピーを編成した消去可能強化学習(\text{E}^3\text{RL}$)を提案する。 $\text{E}^3\text{RL}$は、内因性局所自己回帰的クロスエントロピーを内在的不確実性の座標として基礎付けることにより、外部信号への依存を排除している。セグメントレベルの適応的動的しきい値とアドバンテージアロケーションを導入することで、$\text{E}^3\text{RL}$は、履歴キー値(KV)キャッシュストリームを再利用しながら、局所的な論理的欠陥を正確に抽出し、自己修復機能を備えた推論プロセスを実現する。 DeepMath-103kデータセットで$\text{E}^3\text{RL}$をトレーニングします。実験結果から、$\text{E}^3\text{RL}$は、長いシーケンス推論の探索効率を再評価し、線形メモリオーバーヘッドを維持しながらサンプル効率を向上させることが示された。 AIMEのような数学的推論ベンチマークでは、$\text{E}^3\text{RL}$は、それぞれ5.349\%と6.514\%の4Bと8Bのパラメータモデルで、大幅な性能向上を実現している。これらの結果は、$\text{E}^3\text{RL}$が長期の推論において自己回帰的呪いを破滅させ、次世代の自己修復人工知能(AGI)の理論的かつシステムレベルの基盤を確立することを示唆している。

論文の概要: Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

関連論文リスト