Fugu-MT 論文翻訳(概要): Hybrid Latent Reasoning with Decoupled Policy Optimization

論文の概要: Hybrid Latent Reasoning with Decoupled Policy Optimization

arxiv url: http://arxiv.org/abs/2604.20328v1
Date: Wed, 22 Apr 2026 08:22:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:11.040316
Title: Hybrid Latent Reasoning with Decoupled Policy Optimization
Title（参考訳）: 疎結合ポリシ最適化を用いたハイブリッド潜時推論
Authors: Tao Cheng, Shi-Zhe Chen, Hao Zhang, Yixin Qin, Jinwen Luo, Zheng Wei,
Abstract要約: HyLaR(Hybrid Latent Reasoning)は、連続的な視覚的潜在表現を持つ離散テキスト生成をシームレスにインターリーブするフレームワークである。我々は,HyLaRが細粒度知覚と一般的なマルチモーダル理解ベンチマークにおいて,最先端の潜時推論手法より優れていることを示す。
参考スコア（独自算出の注目度）: 19.348125016748018
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at https://github.com/EthenCheng/HyLaR.
Abstract（参考訳）: CoT(Chain-of-Thought)推論は、マルチモーダル大言語モデル(MLLM)の複雑な問題解決能力を著しく高める。しかしながら、CoTを視覚に適応させることは、通常、LSM入力に適合するシグナルを識別し、初期のセマンティック崩壊を引き起こし、きめ細かい詳細を破棄する。外部ツールはこれを緩和できるが、それらは厳格なボトルネックを導入し、事前定義された操作に対する推論を精査する。最近の潜在的推論パラダイムは、これらの制限を克服するために視覚状態を内部化するが、結果として生じるハイブリッドな離散連続アクション空間を最適化することは依然として困難である。本研究では,連続的な視覚的潜在表現を伴う離散テキスト生成をシームレスにインターリーブするHyLaR(Hybrid Latent Reasoning)を提案する。具体的には,初期冷間開始制御微調整(SFT)に続いて,このハイブリッド空間における効果的な強化学習を実現するために,DePO(Decoupled Policy Optimization)を導入する。 DePOはポリシー勾配の目的を分解し、独立した信頼領域制約をテキストおよび潜在コンポーネントに適用し、正確な閉形式であるvon Mises-Fisher (vMF) KL正則化器と併用する。広汎な実験により、HyLaRは標準的なMLLMや最先端の潜伏推論アプローチより、きめ細かい知覚や一般的なマルチモーダル理解ベンチマークよりも優れていることが示された。コードはhttps://github.com/EthenCheng/HyLaR.comで入手できる。

論文の概要: Hybrid Latent Reasoning with Decoupled Policy Optimization

関連論文リスト