ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2510.27492v2
- Date: Tue, 04 Nov 2025 13:29:38 GMT
- Title: ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
- Authors: Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng,
- Abstract summary: Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought.<n>We build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement.<n>ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic.
- Score: 76.95203056566191
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary rather than isomorphic modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7 percent over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
Related papers
- SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs [27.042422801803344]
We introduce SwimBird, a reasoning-switchable MLLM that switches among three reasoning modes conditioned on the input.<n>We show that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
arXiv Detail & Related papers (2026-02-05T18:59:51Z) - Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space [46.05748768260013]
We propose a test-time Dynamic Multimodal Latent Reasoning framework.<n>It employs confidence-guided latent policy gradient optimization to latent think tokens for in-depth reasoning.<n> Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance.
arXiv Detail & Related papers (2025-12-14T10:07:45Z) - ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation [79.17352367219736]
ROVER tests the use of one modality to guide, verify, or refine outputs in the other.<n>ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning.
arXiv Detail & Related papers (2025-11-03T02:27:46Z) - BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning [11.952788515297913]
DeFacto is a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning.<n>We develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants.<n> Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness.
arXiv Detail & Related papers (2025-09-25T08:58:10Z) - Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models [1.9950682531209158]
We propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language.<n>We show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines.<n>These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT.
arXiv Detail & Related papers (2025-08-18T02:50:20Z) - Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought [72.28041169759454]
multimodal chain-of-thought (MCoT) improves performance and interpretability of large vision-language models (LVLMs)<n>We show that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format.<n>We also explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers.
arXiv Detail & Related papers (2025-05-21T13:29:58Z) - Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning [45.517215214938844]
Chain-of-thought technique has been received well in multi-modal tasks.
We propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning.
arXiv Detail & Related papers (2024-04-06T07:39:44Z) - Visual Chain of Thought: Bridging Logical Gaps with Multimodal
Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data.
Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.