Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs
- URL: http://arxiv.org/abs/2503.11790v3
- Date: Mon, 29 Sep 2025 09:11:45 GMT
- Title: Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs
- Authors: Nasim Borazjanizadeh, Roei Herzig, Eduard Oks, Trevor Darrell, Rogerio Feris, Leonid Karlinsky,
- Abstract summary: Conceptual diagrams externalize mental models, abstracting irrelevant details to efficiently capture how entities interact.<n>Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text.<n>We propose Visual Thinking, a generalizable framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams.
- Score: 59.66595230543127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human reasoning relies on constructing and manipulating mental models -- simplified internal representations of situations used to understand and solve problems. Conceptual diagrams (e.g., a sketch drawn to aid reasoning) externalize these mental models, abstracting irrelevant details to efficiently capture how entities interact. In contrast, Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text, limiting their effectiveness on complex multi-step tasks. In this paper, we propose Visual Thinking, a generalizable framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams, significantly enhancing their combinatorial planning capabilities. Our approach requires no human input beyond the natural language description of the task. It integrates textual and diagrammatic reasoning within an optimized Graph-of-Thought inference framework, enhanced by beam search and depth-wise backtracking. Evaluated on multiple challenging PDDL planning domains, our method substantially improves LMM performance (e.g., GPT-4o: 35.5% -> 90.2% in Blocksworld) and consistently outperforms text-only search-based inference methods. On more difficult domains with solution depths up to 40, it also surpasses the o1-preview reasoning model (e.g., 16 percentage points improvement in Floor Tiles). These results demonstrate the power of conceptual diagrams as a reasoning medium in LMMs.
Related papers
- On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks [56.98385132295952]
We evaluate how well chain-of-thought approaches generalize on a simple planning task.<n>We find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization.<n> purely text-based models consistently outperform those utilizing image-based inputs.
arXiv Detail & Related papers (2026-02-17T09:51:40Z) - Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning [34.940968264459805]
We introduce a training-free visual-reasoning pipeline for Large Language Models (LLMs)<n>A powerful LLM orchestrates the high-level reasoning, strategically interrogating a LMM to extract specific visual information required for its logical chain.<n>Our framework effectively governs the visual reasoning process, leading to a significant reduction in visually-unfounded reasoning steps and a substantial improvement in reasoning fidelity.
arXiv Detail & Related papers (2025-09-27T14:13:41Z) - Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models [28.82265769298008]
We introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities.<n>The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs.
arXiv Detail & Related papers (2025-08-19T20:53:24Z) - PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving [57.22004912994658]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning.
After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models.
Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.
TVC helps the model retain attention to the visual components throughout the reasoning.
Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - ReasonGraph: Visualisation of Reasoning Paths [28.906801344540458]
ReasonGraph is a web-based platform for visualizing and analyzing Large Language Models (LLMs) reasoning processes.
It supports both sequential and tree-based reasoning methods while integrating with major LLM providers and over fifty state-of-the-art models.
arXiv Detail & Related papers (2025-03-06T00:03:55Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - GraphThought: Graph Combinatorial Optimization with Thought Generation [16.457287060520454]
Graph optimization (GCO) problems are central to domains like logistics and bioinformatics.<n>We first formalize the Optimal Thoughts Design (OTD) problem, which provides a structured guidance for producing high-quality intermediate reasoning steps.<n>We introduce GraphThought, a novel framework that generates effective reasoning sequences through either-guided forward search or solver-aligned backward reasoning.
arXiv Detail & Related papers (2025-02-17T09:50:41Z) - VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context [41.11701706312843]
We design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems.
We present a Description-Program-Reasoning (DPR) chain to enhance the logical accuracy of reasoning processes.
Our study shows that GPT-4V outperforms Gemini Pro in multi-step graph reasoning.
arXiv Detail & Related papers (2024-05-08T10:42:48Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation [34.45251681923171]
This paper presents a novel approach to develop a large Vision-and-Language Models (VLMs)
We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process.
The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge.
arXiv Detail & Related papers (2024-01-18T14:21:56Z) - Guiding Language Model Reasoning with Planning Tokens [122.43639723387516]
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks.
We propose a hierarchical generation scheme to encourage a more structural generation of chain-of-thought steps.
Our approach requires a negligible increase in trainable parameters (0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme.
arXiv Detail & Related papers (2023-10-09T13:29:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.