Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings
- URL: http://arxiv.org/abs/2602.00574v1
- Date: Sat, 31 Jan 2026 07:36:38 GMT
- Title: Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings
- Authors: Yifei Shao, Kun Zhou, Ziming Xu, Mohammad Atif Quamar, Shibo Hao, Zhen Wang, Zhiting Hu, Biwei Huang,
- Abstract summary: We study how to extend chain-of-thought (CoT) beyond language to better handle multimodal reasoning.<n>We introduce modal-mixed CoT, which interleaves textual tokens with compact visual sketches represented as latent embeddings.<n>Our method yields better performance than language-only and other CoT methods.
- Score: 39.4633015395276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study how to extend chain-of-thought (CoT) beyond language to better handle multimodal reasoning. While CoT helps LLMs and VLMs articulate intermediate steps, its text-only form often fails on vision-intensive problems where key intermediate states are inherently visual. We introduce modal-mixed CoT, which interleaves textual tokens with compact visual sketches represented as latent embeddings. To bridge the modality gap without eroding the original knowledge and capability of the VLM, we use the VLM itself as an encoder and train the language backbone to reconstruct its own intermediate vision embeddings, to guarantee the semantic alignment of the visual latent space. We further attach a diffusion-based latent decoder, invoked by a special control token and conditioned on hidden states from the VLM. In this way, the diffusion head carries fine-grained perceptual details while the VLM specifies high-level intent, which cleanly disentangles roles and reduces the optimization pressure of the VLM. Training proceeds in two stages: supervised fine-tuning on traces that interleave text and latents with a joint next-token and latent-reconstruction objective, followed by reinforcement learning that teaches when to switch modalities and how to compose long reasoning chains. Extensive experiments across 11 diverse multimodal reasoning tasks, demonstrate that our method yields better performance than language-only and other CoT methods. Our code will be publicly released.
Related papers
- Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models [53.06230963851451]
JARVIS is a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.<n>We introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.
arXiv Detail & Related papers (2025-12-17T19:01:34Z) - Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning [11.901989132359676]
We introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual reinforcement learning (RL)<n>Semore simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows.<n>Our method exhibits efficient and adaptive ability compared to state-of-art methods.
arXiv Detail & Related papers (2025-12-04T16:54:41Z) - Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space [66.76138204796497]
Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer.<n>We propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space.<n>Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches.
arXiv Detail & Related papers (2025-10-14T14:58:25Z) - Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought [11.538345159297839]
Chain-of-thought (CoT) prompting has been adapted for large vision-language models (LLMs) to enhance multi-modal reasoning.<n>Existing LVLMs often ignore the contents of generated rationales in CoT reasoning.<n>We propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy.
arXiv Detail & Related papers (2025-07-10T12:07:13Z) - CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention [32.07189678228538]
Multimodal in-context learning (ICL) is emerging as a key capability that enables large vision-language models (LVLMs) to adapt to novel tasks without parameter updates.<n>ICL remains unstable, even with well-matched in-context demonstrations (ICDs), suggesting that LVLMs struggle to fully utilize the provided context.<n>We propose textbfContext-Aware Modulated Attention (CAMA), a plug-and-play and training-free method that dynamically modulates LVLM's attention logits based on the input in-context sequence.
arXiv Detail & Related papers (2025-05-21T04:25:23Z) - AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding [79.43306110124875]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.<n>Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z) - Interleaved-Modal Chain-of-Thought [14.342351827047862]
Chain-of-Thought prompting elicits a series of intermediate reasoning steps before arriving at the final answer.<n>We propose an image-incorporated multimodal Chain-of-Thought, named textbfInterleaved-modal Chain-of-Thought (ICoT)<n>ICoT generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer.
arXiv Detail & Related papers (2024-11-29T06:06:35Z) - Towards Multimodal In-Context Learning for Vision & Language Models [21.69457980865084]
State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality.
We propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes.
arXiv Detail & Related papers (2024-03-19T13:53:37Z) - Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation.
We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation.
We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.