Related papers: Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

URL: http://arxiv.org/abs/2506.23918v3
Date: Thu, 03 Jul 2025 16:49:39 GMT
Title: Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Authors: Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung,
Abstract summary: A similar evolution is now unfolding in AI, marking a paradigm shift from models that merely think about images to those that can truly think with images.<n>This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace.
Score: 90.4459196223986
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.

Related papers

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z)
Thinking with Generated Images [30.28526622443551]
We present Thinking with Generated Images, a novel paradigm that transforms how large multimodal models (LMMs) engage with visual reasoning.<n>Our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking.
arXiv Detail & Related papers (2025-05-28T16:12:45Z)
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought [72.93910800095757]
multimodal chain-of-thought (MCoT) improves performance and interpretability of large vision-language models (LVLMs)<n>We show that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format.<n>We also explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers.
arXiv Detail & Related papers (2025-05-21T13:29:58Z)
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities [22.476740954286836]
We present a comprehensive survey aimed at guiding future research.<n>We review existing unified models, categorizing them into three main architectural paradigms.<n>We discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data.
arXiv Detail & Related papers (2025-05-05T11:18:03Z)
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation [14.157948867532832]
We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation.<n>Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC)<n>Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework.
arXiv Detail & Related papers (2025-04-24T02:41:34Z)
CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation [12.008690947774015]
We propose a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding.<n>The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, as supervisory signals.<n>The introduction of a test-time memory augmentation module that expands the model reasoning capacity during inference.
arXiv Detail & Related papers (2025-03-07T09:13:17Z)
Autoregressive Models in Vision: A Survey [119.23742136065307]
This survey comprehensively examines the literature on autoregressive models applied to vision.<n>We divide visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models.<n>We present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation.
arXiv Detail & Related papers (2024-11-08T17:15:12Z)
Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction [60.964512894143475]
We present Generative Spatial Transformer ( GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction. Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.
arXiv Detail & Related papers (2024-10-24T17:58:05Z)
Visualizing and Understanding Contrastive Learning [22.553990823550784]
We design visual explanation methods that contribute towards understanding similarity learning tasks from pairs of images. We also adapt existing metrics, used to evaluate visual explanations of image classification systems, to suit pairs of explanations.
arXiv Detail & Related papers (2022-06-20T13:01:46Z)
WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data. We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z)
New Ideas and Trends in Deep Multimodal Content Understanding: A Review [24.576001583494445]
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. This paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants.
arXiv Detail & Related papers (2020-10-16T06:50:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.