VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
- URL: http://arxiv.org/abs/2504.09130v1
- Date: Sat, 12 Apr 2025 08:37:30 GMT
- Title: VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
- Authors: Yikun Wang, Siyin Wang, Qinyuan Cheng, Zhaoye Fei, Liang Ding, Qipeng Guo, Dacheng Tao, Xipeng Qiu,
- Abstract summary: VisuoThink is a novel framework that seamlessly integrates visuospatial and linguistic domains.<n>It enables progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search.
- Score: 89.43196232124883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.
Related papers
- Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs [22.46006112029019]
Mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation.<n>We introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of Multimodal Large Language Models (MLLMs) through four carefully constructed puzzles.<n>Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs.
arXiv Detail & Related papers (2025-07-16T05:54:37Z) - Reasoning in machine vision: learning to think fast and slow [10.430190333487957]
Reasoning is a hallmark of human intelligence, enabling adaptive decision-making in complex and unfamiliar scenarios.<n>Machine intelligence remains bound to training data, lacking the ability to dynamically refine solutions at inference time.<n>Here we present a novel learning paradigm that enables machine reasoning in vision by allowing performance improvement with increasing thinking time.
arXiv Detail & Related papers (2025-06-27T10:03:05Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - Visual Abstract Thinking Empowers Multimodal Reasoning [11.70318717106245]
Images usually convey richer detail than text, but often include redundant information which downgrades multimodal reasoning performance.<n>Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT)<n>VAT prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance.<n> Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline.
arXiv Detail & Related papers (2025-05-26T16:06:35Z) - DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning [11.242852367476015]
DeepEyes is a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning.<n>We propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories.<n>DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks.
arXiv Detail & Related papers (2025-05-20T13:48:11Z) - Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z) - Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [124.69672273754144]
HaluSearch is a novel framework that incorporates tree search-based algorithms.<n>It frames text generation as a step-by-step reasoning process.<n>We introduce a hierarchical thinking system switch mechanism inspired by the dual process theory in cognitive science.
arXiv Detail & Related papers (2025-01-02T15:36:50Z) - Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ? [5.076961098583674]
We introduce a novel adversarial dataset to provide evidence for the dual thinking framework in human vision.<n>Our psychophysical studies show the presence of multiple inferences in rapid succession.<n>Analysis of errors shows that the early stopping of visual processing can result in missing relevant information.
arXiv Detail & Related papers (2024-06-11T05:50:34Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks.
We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces.
VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z) - What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models [50.97705264224828]
We propose Counterfactual Inception, a novel method that implants counterfactual thinking into Large Multi-modal Models.
We aim for the models to engage with and generate responses that span a wider contextual scene understanding.
Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination.
arXiv Detail & Related papers (2024-03-20T11:27:20Z) - Visual cognition in multimodal large language models [12.603212933816206]
Recent advancements have rekindled interest in the potential to emulate human-like cognitive abilities.
This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology.
arXiv Detail & Related papers (2023-11-27T18:58:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.