Fugu-MT 論文翻訳(概要): Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

論文の概要: Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

arxiv url: http://arxiv.org/abs/2604.10500v2
Date: Thu, 16 Apr 2026 01:21:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 16:09:14.144974
Title: Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Title（参考訳）: マルチモーダル遅延推論のための視覚的深度スケーリング
Authors: Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu,
Abstract要約: マルチモーダル潜在推論は、明示的なChain-of-Thoughtデコーディングを暗黙的な特徴伝達に置き換える、有望なパラダイムとして登場した。視覚的認識を協調的に強化し,より深い文脈推論のために複雑な潜伏を洗練するための視覚再生モジュールとルーティング深度スケーリングを提案する。我々のフレームワークは、さまざまなベンチマークで最先端のパフォーマンスを実現しつつ、明示的なCoTベースラインよりもかなりの推論スピードアップを実現しています。
参考スコア（独自算出の注目度）: 32.211888127924446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly higher and more volatile gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.
Abstract（参考訳）: マルチモーダル遅延推論は、明示的なChain-of-Thought(CoT)デコーディングを暗黙的な特徴伝搬に置き換え、表現の伝達性を同時に拡張し、推論レイテンシを低減する、有望なパラダイムとして登場した。 1)視覚的トークンは固有の言語バイアスによるテキスト的基準よりも著しく高い揮発性勾配ノルムを示し,体系的な視覚的過度最適化をもたらすこと,(2)意味論的に単純なトークンは急速に収束する一方で,複雑なトークンは固定されたアーキテクチャの深さによって制約された永続的な勾配不安定を示すこと,の2つの重要な観察結果が得られた。これらの制約に対処するために、視覚的認識を協調的に強化し、より深い文脈的推論のために複雑な潜伏を洗練するための視覚的再生モジュールとルーティング深度スケーリングを提案する。前者のモジュールは因果自己注意を利用してトークンの正当性を推定し、空間的に一貫性のある制約によってきめ細かいグラウンドを補強する。補足的に、後者のメカニズムは複雑なトークンに追加の推論ステップを適応的に割り当て、より深い文脈改善を可能にする。明示的なCoTをコンパクトな潜在表現に段階的に内包するカリキュラム戦略により、我々のフレームワークは様々なベンチマークで最先端のパフォーマンスを実現し、明示的なCoTベースラインよりも相当な推論スピードアップを実現している。

論文の概要: Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

関連論文リスト