CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
- URL: http://arxiv.org/abs/2511.02360v1
- Date: Tue, 04 Nov 2025 08:28:46 GMT
- Title: CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
- Authors: Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan,
- Abstract summary: CoCoVa is a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks.<n>The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former acts as a dynamic reasoning engine.<n>We train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction.
- Score: 22.835301879575002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.
Related papers
- Multimodal Latent Reasoning via Hierarchical Visual Cues Injection [16.779425236020433]
This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly.<n>We propose a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales.<n>We show that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.
arXiv Detail & Related papers (2026-02-05T06:31:12Z) - Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space [46.05748768260013]
We propose a test-time Dynamic Multimodal Latent Reasoning framework.<n>It employs confidence-guided latent policy gradient optimization to latent think tokens for in-depth reasoning.<n> Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance.
arXiv Detail & Related papers (2025-12-14T10:07:45Z) - Monet: Reasoning in Latent Visual Space Beyond Images and Language [55.424507246294326]
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning.<n>Existing methods fall short of human-like abstract visual thinking.<n>We introduce Monet, a training framework that enables multimodal large language models to reason directly within the latent visual space.
arXiv Detail & Related papers (2025-11-26T13:46:39Z) - ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning [76.95203056566191]
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought.<n>We build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement.<n>ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic.
arXiv Detail & Related papers (2025-10-30T17:51:38Z) - From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models [66.36007274540113]
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world.<n>They often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition)<n>This survey introduces a novel and unified analytical framework: From Perception to Cognition"
arXiv Detail & Related papers (2025-09-29T18:25:40Z) - Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models [1.9950682531209158]
We propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language.<n>We show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines.<n>These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT.
arXiv Detail & Related papers (2025-08-18T02:50:20Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.