Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs
- URL: http://arxiv.org/abs/2503.11790v1
- Date: Fri, 14 Mar 2025 18:27:02 GMT
- Title: Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs
- Authors: Nasim Borazjanizadeh, Roei Herzig, Eduard Oks, Trevor Darrell, Rogerio Feris, Leonid Karlinsky,
- Abstract summary: Large Language Models (LLMs) and Large Multimodal Models (LMMs) predominantly reason through textual representations.<n>We propose a zero-shot fully automatic framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams.
- Score: 57.66267515456075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human reasoning relies on constructing and manipulating mental models-simplified internal representations of situations that we use to understand and solve problems. Conceptual diagrams (for example, sketches drawn by humans to aid reasoning) externalize these mental models, abstracting irrelevant details to efficiently capture relational and spatial information. In contrast, Large Language Models (LLMs) and Large Multimodal Models (LMMs) predominantly reason through textual representations, limiting their effectiveness in complex multi-step combinatorial and planning tasks. In this paper, we propose a zero-shot fully automatic framework that enables LMMs to reason through multiple chains of self-generated intermediate conceptual diagrams, significantly enhancing their combinatorial planning capabilities. Our approach does not require any human initialization beyond a natural language description of the task. It integrates both textual and diagrammatic reasoning within an optimized graph-of-thought inference framework, enhanced by beam search and depth-wise backtracking. Evaluated on multiple challenging PDDL planning domains, our method substantially improves GPT-4o's performance (for example, from 35.5% to 90.2% in Blocksworld). On more difficult planning domains with solution depths up to 40, our approach outperforms even the o1-preview reasoning model (for example, over 13% improvement in Parking). These results highlight the value of conceptual diagrams as a complementary reasoning medium in LMMs.
Related papers
- Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning.
After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models.
Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.
TVC helps the model retain attention to the visual components throughout the reasoning.
Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - ReasonGraph: Visualisation of Reasoning Paths [28.906801344540458]
ReasonGraph is a web-based platform for visualizing and analyzing Large Language Models (LLMs) reasoning processes.
It supports both sequential and tree-based reasoning methods while integrating with major LLM providers and over fifty state-of-the-art models.
arXiv Detail & Related papers (2025-03-06T00:03:55Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context [41.11701706312843]
We design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems.
We present a Description-Program-Reasoning (DPR) chain to enhance the logical accuracy of reasoning processes.
Our study shows that GPT-4V outperforms Gemini Pro in multi-step graph reasoning.
arXiv Detail & Related papers (2024-05-08T10:42:48Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation [34.45251681923171]
This paper presents a novel approach to develop a large Vision-and-Language Models (VLMs)
We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process.
The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge.
arXiv Detail & Related papers (2024-01-18T14:21:56Z) - Guiding Language Model Reasoning with Planning Tokens [122.43639723387516]
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks.
We propose a hierarchical generation scheme to encourage a more structural generation of chain-of-thought steps.
Our approach requires a negligible increase in trainable parameters (0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme.
arXiv Detail & Related papers (2023-10-09T13:29:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.