Fugu-MT 論文翻訳(概要): LEPO: Latent Reasoning Policy Optimization for Large Language Models

論文の概要: LEPO: Latent Reasoning Policy Optimization for Large Language Models

arxiv url: http://arxiv.org/abs/2604.17892v2
Date: Tue, 21 Apr 2026 03:14:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 14:04:47.936308
Title: LEPO: Latent Reasoning Policy Optimization for Large Language Models
Title（参考訳）: LEPO:大規模言語モデルの潜在推論ポリシー最適化
Authors: Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, Qiang Lin,
Abstract要約: Gumbel-Softmax による潜在推論に制御性を導入する。 textbfunderline Latent Rtextbfunderlineesoning textbfunderlinePolicy textbfunderlineOptimization(textbfLEPO)を提案する。テストでは、LEPOは離散的および潜在的推論のために既存のRL法を著しく上回っている。
参考スコア（独自算出の注目度）: 11.032175358561162
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~(\textbf{LEPO}), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.
Abstract（参考訳）: 近年,大規模言語モデル (LLM) に潜時推論を導入し,連続した空間内でリッチな情報を活用する手法が提案されている。しかし、確率的なサンプリングがなければ、これらの手法は必然的に決定論的推論に崩壊し、多様な推論経路を見つけられなかった。このギャップを埋めるために、Gumbel-Softmaxを介して制御可能な確率性を潜在推論に注入し、LLMの探索能力を回復し、強化学習(Reinforcement Learning, RL)との互換性を高める。これに基づいて、RLを連続潜在表現に直接適用する新しいフレームワークである、textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~(\textbf{LEPO})を提案する。具体的には、ロールアウト段階では、LEPOは様々な軌道サンプリングを可能にする確率性を維持し、最適化段階では、LEPOは遅延表現と離散トークンの両方に対して統一的な勾配推定を構築する。大規模な実験により、LEPOは離散的および潜在的推論のために既存のRL法を著しく上回っていることが示された。

論文の概要: LEPO: Latent Reasoning Policy Optimization for Large Language Models

関連論文リスト