Fugu-MT 論文翻訳(概要): Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

論文の概要: Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

arxiv url: http://arxiv.org/abs/2605.07106v1
Date: Fri, 08 May 2026 01:33:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.721446
Title: Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
Title（参考訳）: Retrieve, Integrate, and Synthesize: Space-Semantic Grounded Latent Visual Reasoning
Authors: Jin Cui, Xinyue Long, Xunyong Zhang, Yadong Zhang, Chuanchang Su, Jingye Gan, Boran Zhao, Pengju Ren,
Abstract要約: 本稿では,事前学習型MLLM計算の互換性拡張として遅延推論を開発する空間意味的基盤となるRIS(Retrieve,Integrate,Synthesize)を提案する。 RISは潜伏トークンを空間的および意味的な証拠の両方に固定し、進行的な注意ボトルネックを通じて因果的役割を強制し、翻訳された潜伏状態から語彙に整合した復号に戻すために短い言語遷移トークンを導入する。
参考スコア（独自算出の注目度）: 11.05919811646786
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、視覚言語推論において顕著な進歩を遂げているが、ほとんどの手法は、視覚的証拠を離散的なテキスト思考に圧縮し、きめ細かい知覚のための情報のボトルネックを生み出している。近年の潜在的視覚的推論法は、連続的な隠蔽状態における推論を試みているが、それらが不十分な多様体の整合性に悩まされていることが判明した。これらの問題に対処するために,事前学習型MLLM計算の互換性拡張として潜時推論を開発する空間意味的基盤となるRIS(Retrieve,Integrate,Synthesize)を提案する。まず,境界ボックスと領域固有の意味記述を用いたステップワイドな推論データセットを構築した。この監視に基づいて構築されたRISは、潜伏トークンを空間的および意味的な証拠の両方に固定し、進行的な注意ボトルネックを通じて因果的役割を強制する。 V*、HRBench4K、HRBench8K、MMVP、BLINKの実験では、クローズド/オープンソース、潜伏する推論ベースラインよりも一貫した改善が見られた。さらなる分析により、RISは多様で解釈可能で、段階的に統合された潜在軌道を学習し、MLLMにおける忠実な内部視覚的推論への実践的な道を提供することが示された。

論文の概要: Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

関連論文リスト