Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space
- URL: http://arxiv.org/abs/2510.12603v1
- Date: Tue, 14 Oct 2025 14:58:25 GMT
- Title: Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space
- Authors: Chao Chen, Zhixin Ma, Yongqi Li, Yupeng Hu, Yinwei Wei, Wenjie Li, Liqiang Nie,
- Abstract summary: Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer.<n>We propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space.<n>Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches.
- Score: 66.76138204796497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at https://github.com/FYYDCC/IVT-LR.
Related papers
- SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs [27.042422801803344]
We introduce SwimBird, a reasoning-switchable MLLM that switches among three reasoning modes conditioned on the input.<n>We show that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
arXiv Detail & Related papers (2026-02-05T18:59:51Z) - Vision-aligned Latent Reasoning for Multi-modal Large Language Model [82.26044667101011]
Vision-aligned Latent Reasoning (VaLR) is a framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step.<n>VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders.
arXiv Detail & Related papers (2026-02-04T12:04:02Z) - Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings [39.4633015395276]
We study how to extend chain-of-thought (CoT) beyond language to better handle multimodal reasoning.<n>We introduce modal-mixed CoT, which interleaves textual tokens with compact visual sketches represented as latent embeddings.<n>Our method yields better performance than language-only and other CoT methods.
arXiv Detail & Related papers (2026-01-31T07:36:38Z) - Monet: Reasoning in Latent Visual Space Beyond Images and Language [55.424507246294326]
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning.<n>Existing methods fall short of human-like abstract visual thinking.<n>We introduce Monet, a training framework that enables multimodal large language models to reason directly within the latent visual space.
arXiv Detail & Related papers (2025-11-26T13:46:39Z) - Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes [54.374410871041164]
Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks.<n>Recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities.<n>We refer to this phenomenon as the textitmodality gap, defined as the performance disparity between text-centric and vision-centric inputs.
arXiv Detail & Related papers (2025-10-26T21:06:13Z) - Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning [78.17782197231325]
We propose a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective.<n> Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance.
arXiv Detail & Related papers (2025-06-05T02:28:07Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - Interleaved-Modal Chain-of-Thought [14.342351827047862]
Chain-of-Thought prompting elicits a series of intermediate reasoning steps before arriving at the final answer.<n>We propose an image-incorporated multimodal Chain-of-Thought, named textbfInterleaved-modal Chain-of-Thought (ICoT)<n>ICoT generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer.
arXiv Detail & Related papers (2024-11-29T06:06:35Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.