Fugu-MT 論文翻訳(概要): Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

論文の概要: Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

arxiv url: http://arxiv.org/abs/2605.02735v1
Date: Mon, 04 May 2026 15:36:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.378784
Title: Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Title（参考訳）: 視覚的潜伏者が言う以上に知っている:MLLMにおける潜伏推論の無秩序化
Authors: Xin Zhang, Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou,
Abstract要約: 連続潜在空間推論は、マルチモーダルモデルに対するテキストチェーンのコンパクトな代替を提供する。既存の視覚的推論手法では,これまで見過ごされてきた最適化病理を同定する。パラメータ更新を伴わない推論時間潜時最適化は、視覚潜時における抑止的推論能力を効果的に解き放つことを示す。
参考スコア（独自算出の注目度）: 54.16324124242172
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I, visual latents are warmed up via query-guided contrastive latent--visual alignment, improving semantic quality while preventing latent collapse. In Stage II, the latent reasoning is further optimized via a confidence-progression reward, which incentivizes predicted token distributions along the latent span to become progressively more concentrated, routing predictions through the latent reasoning rather than bypassing it. Experiments across eight benchmarks and four model backbones show that inference-time latent optimization, without any parameter updates, effectively unleashes the suppressed reasoning capacity of visual latents.
Abstract（参考訳）: 連続潜在空間推論は、マルチモーダルモデルに対するテキストチェーンのコンパクトな代替を提供し、明示的な推論トークンを使わずに高次元の視覚的エビデンスを統合することができる。しかし、既存の視覚的推論手法では、これまで見過ごされていた最適化病理を同定し、トレーニング中に視覚的潜在者が意味的に豊かになるが、最終的な回答予測への貢献は体系的に抑制される。共有パラメータ空間内では、自己回帰的目的は直接的な視覚的入力に依存するショートカットを好んでおり、情報的推論コンテンツではなく、遅延トークンを遷移的状態へと誘導する。我々はこの現象をSilenced Visual Latentsと呼ぶ。これを解決するために、推論時に潜伏推論を直接最適化し、バックボーンパラメータを凍結し続けることによって、対立する2つの目的を解消する。ステージIでは、視覚的潜伏語はクエリ誘導によるコントラスト的潜伏語アライメントによってウォームアップされる。ステージIIでは、潜伏推論は、潜伏確率の報酬によってさらに最適化され、潜伏幅に沿って予測されたトークン分布をインセンティブ化し、徐々に集中化され、潜伏推理をバイパスするのではなく、潜伏推理を通して予測をルーティングする。 8つのベンチマークと4つのモデルバックボーンによる実験では、パラメータの更新なしに、推論時遅延最適化が視覚潜伏者の抑制された推論能力を効果的に解き放つことが示されている。

論文の概要: Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

関連論文リスト