Fugu-MT 論文翻訳(概要): Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

論文の概要: Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

arxiv url: http://arxiv.org/abs/2510.12603v1
Date: Tue, 14 Oct 2025 14:58:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.359782
Title: Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space
Title（参考訳）: 暗黒における推論:潜時空間におけるインターリーブされた視覚テキスト推論
Authors: Chao Chen, Zhixin Ma, Yongqi Li, Yupeng Hu, Yinwei Wei, Wenjie Li, Liqiang Nie,
Abstract要約: マルチモーダル推論は、最終回答に到達する前に中間推論ステップを組み込むことでMLLMの能力を高めることを目的としている。本稿では,視覚情報とテキスト情報の両方を潜在空間内の推論プロセスに注入するInterleaved Vision-Text Latent Reasoning (IVT-LR)を提案する。 M3CoTとScienceQAの実験により、我々のIVT-LR法は5.45%の精度で平均的な性能向上を実現し、同時に既存の手法に比べて5倍以上の速度向上を実現した。
参考スコア（独自算出の注目度）: 66.76138204796497
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at https://github.com/FYYDCC/IVT-LR.
Abstract（参考訳）: マルチモーダル推論は、最終回答に到達する前に中間推論ステップを組み込むことでMLLMの能力を高めることを目的としている。テキストのみの推論から視覚情報の統合へと進化し、思考プロセスは画像とテキストの両方を通して伝達されるようになった。その効果にもかかわらず、現在のマルチモーダル推論手法は、労働集約的な視覚テキストアノテーションを必要とする明示的な推論ステップに依存し、本質的には大きな推論遅延をもたらす。これらの問題に対処するために、マルチモーダル表現、アノテーションの削減、推論効率の利点を活かしたマルチモーダル潜在推論を導入する。そこで本研究では,視覚情報とテキスト情報の両方を潜在空間内の推論プロセスに注入するInterleaved Vision-Text Latent Reasoning (IVT-LR)を提案する。具体的には、IVT-LRは2つの暗黙的な部分:潜時テキスト(前のステップから隠された状態)と潜時視覚(選択された画像埋め込みの集合)を組み合わせることによって、各推論ステップを表す。さらに、MLLMが上記マルチモーダル遅延推論ステップを実行できるように、プログレッシブなマルチステージトレーニング戦略を導入する。 M3CoTとScienceQAの実験により、我々のIVT-LR法は5.45%の精度で平均的な性能向上を実現し、同時に既存の手法に比べて5倍以上の速度向上を実現した。コードはhttps://github.com/FYYDCC/IVT-LRで公開されている。

論文の概要: Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

関連論文リスト