Fugu-MT 論文翻訳(概要): Imagination Helps Visual Reasoning, But Not Yet in Latent Space

論文の概要: Imagination Helps Visual Reasoning, But Not Yet in Latent Space

arxiv url: http://arxiv.org/abs/2602.22766v1
Date: Thu, 26 Feb 2026 08:56:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-27 18:41:22.60845
Title: Imagination Helps Visual Reasoning, But Not Yet in Latent Space
Title（参考訳）: Imaginationは視覚的推論を助けるが、まだ遅い空間にはない
Authors: You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, Maosong Sun,
Abstract要約: 因果関係分析を用いた潜伏推論の有効性について検討した。潜在トークンが限られた視覚情報を符号化し、高い類似性を示すことを示す。 CapImagineという簡単な代替案を提案し、テキストを明示的に想像するようにモデルに教える。
参考スコア（独自算出の注目度）: 65.80396132375571
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.
Abstract（参考訳）: 潜在視覚推論は、多モーダル大言語モデルの隠された状態を通して、人間の想像過程を模倣することを目的としている。視覚的推論のための有望なパラダイムとして認識されているが、その効果を駆動するメカニズムはいまだ不明である。本研究は, その効用源を解明するために, 因果メディエーション分析を用いた潜伏推論の有効性について検討した。プロセスは因果連鎖としてモデル化され、処理としての入力、メディエーターとしての潜伏トークン、結果としての最終的な回答である。以上の結果から,2つの重要な解離が判明した。 a) 入力-遅延切断: 入力に対する劇的な摂動は潜在トークンに無視できない変化をもたらし、潜在トークンが入力シーケンスに効果的に関与しないことを示唆する。 (b)潜伏解離:潜伏トークンの摂動は最終回答に最小限の影響を与えるもので、潜伏トークンが結果に影響を及ぼす限られた因果効果を示す。さらに、広範囲にわたる探索分析により、潜在トークンが限られた視覚情報を符号化し、高い類似性を示すことが明らかとなった。その結果、潜在推論の必要性に挑戦し、テキストを明示的に想像するようにモデルに教えるCapImagineという簡単な代替案を提案する。視覚中心のベンチマークの実験では、CapImagineは複雑な潜在空間のベースラインを著しく上回り、明示的な想像力による視覚的推論の優れた可能性を強調している。

関連論文リスト

How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? [45.11635323173876]
我々は、プロセスにおける潜伏表現の役割と振舞いをよりよく理解するために、潜伏推論手法の包括的な分析を行う。潜在表現は複数の可能性をエンコードできるが、推論プロセスは構造化検索を忠実に実装していない。より強い監督はショートカット行動を緩和するが、多種多様な仮説を維持するために潜伏表現の能力を制限する。
論文参考訳（メタデータ） (2026-02-25T22:00:59Z)
Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization [78.94590726578014]
マルチモーダル推論モデル (Multimodal reasoning model, MLRM) は幻覚の傾向が強く, 効果的な解はいまだ未発見のままである。 textbfCompression と textbfPreference textbfOptimization を組み合わせたトレーニングベースの緩和フレームワーク C3PO を提案する。
論文参考訳（メタデータ） (2026-02-03T11:00:55Z)
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning [61.29300723302152]
レーザーは動的ウィンドウアライメント学習(DWAL)を通して視覚的推論を再構成する新しいパラダイムであるレーザーは遅延推論法で最先端のパフォーマンスを達成し、強いベースラインのモネを平均5.03%上回る。
論文参考訳（メタデータ） (2026-01-11T08:30:49Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
テキスト慣性(textual inertia)と呼ばれる重要な障害モードを特定し、矛盾する視覚的証拠を無視しながら、モデルは間違ったテキストに盲目的に固執する傾向がある。本稿では,多種多様なLMMの推論連鎖に摂動を構造的に注入するLogicGraph摂動プロトコルを提案する。その結果,10%未満の症例で自己修正が成功し,主に視覚的テキスト誤りの伝播に寄与することが判明した。
論文参考訳（メタデータ） (2026-01-07T16:39:34Z)
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models [27.228426342808486]
我々は、視覚エンコーダ(VE)内の不確実な視覚トークンが、物体の幻覚に寄与する重要な要因であると主張している。本稿では,VEのみを修飾することにより,物体の幻覚を緩和するための簡易かつ効果的な戦略を提案する。
論文参考訳（メタデータ） (2025-10-10T05:12:52Z)
Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity [25.725999088297392]
MLLM(Multimodal Large Language Models)は、視覚言語タスクにまたがる印象的な機能を示す。彼らは幻覚に悩まされ、入力画像やテキストと意味的に矛盾する出力を生成する。本稿では,因果完全性に基づく新しい強化学習フレームワークを提案する。
論文参考訳（メタデータ） (2025-08-06T08:09:12Z)
A Survey on Latent Reasoning [100.54120559169735]
大きな言語モデル(LLM)は印象的な推論機能を示している。中間ステップを言語化するCoT推論は、モデルの表現帯域幅を制限する。潜在的推論は、モデルの連続的な隠れ状態に完全にマルチステップの推論を実行することで、このボトルネックに対処する。
論文参考訳（メタデータ） (2025-07-08T17:29:07Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
大規模言語モデル (LLM) は、多くの人が知能の形式を示すと結論づけている。本稿では,潜在離散変数として表現される人間解釈可能な概念に基づいてトークンを生成する新しい生成モデルを提案する。
論文参考訳（メタデータ） (2025-03-12T01:21:17Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。