Related papers: Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

URL: http://arxiv.org/abs/2404.03622v2
Date: Fri, 24 May 2024 04:07:44 GMT
Title: Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
Authors: Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, Furu Wei,
Abstract summary: We propose visualization-of-Thought (VoT) prompting for large language models (LLMs) VoT elicits spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employ VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds.
Score: 71.93366651585275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs.

Related papers

XToM: Exploring the Multilingual Theory of Mind for Large Language Models [57.9821865189077]
Existing evaluations of Theory of Mind in LLMs are largely limited to English.<n>We present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages.<n>Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.
arXiv Detail & Related papers (2025-06-03T05:23:25Z)
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought [83.89629325805505]
We introduce Argus to address limitations with a new visual attention grounding mechanism.<n>Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention.
arXiv Detail & Related papers (2025-05-29T17:59:56Z)
How Deep is Love in LLMs' Hearts? Exploring Semantic Size in Human-like Cognition [75.11808682808065]
This study investigates whether large language models (LLMs) exhibit similar tendencies in understanding semantic size. Our findings reveal that multi-modal training is crucial for LLMs to achieve more human-like understanding. Lastly, we examine whether LLMs are influenced by attention-grabbing headlines with larger semantic sizes in a real-world web shopping scenario.
arXiv Detail & Related papers (2025-03-01T03:35:56Z)
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT) It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z)
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces [34.809309396448654]
We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs. We find that Multimodal Large Language Models (MLLMs) exhibit competitive - though subhuman - visual-spatial intelligence.
arXiv Detail & Related papers (2024-12-18T18:59:54Z)
Retrieving Implicit and Explicit Emotional Events Using Large Language Models [4.245183693179267]
Large language models (LLMs) have garnered significant attention in recent years due to their impressive performance. This study investigates LLMs' emotion retrieval capabilities in commonsense.
arXiv Detail & Related papers (2024-10-24T19:56:28Z)
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z)
VCoder: Versatile Vision Encoders for Multimodal Large Language Models [46.95488342139727]
Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks. However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail. We propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs.
arXiv Detail & Related papers (2023-12-21T18:49:47Z)
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text. We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z)
Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding [1.3654846342364308]
Large Language Models (LLMs) are unparalleled in their ability to generate grammatically correct, fluent text. This position paper critically assesses three points recurring in critiques of LLM capacities. We outline a pragmatic perspective on the issue of real' understanding and intentionality in LLMs.
arXiv Detail & Related papers (2023-10-30T15:51:04Z)
Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners [75.85554779782048]
Large Language Models (LLMs) have excited the natural language and machine learning community over recent years. Despite of numerous successful applications, the underlying mechanism of such in-context capabilities still remains unclear. In this work, we hypothesize that the learned textitsemantics of language tokens do the most heavy lifting during the reasoning process.
arXiv Detail & Related papers (2023-05-24T07:33:34Z)
Are LLMs the Master of All Trades? : Exploring Domain-Agnostic Reasoning Skills of LLMs [0.0]
This study aims to investigate the performance of large language models (LLMs) on different reasoning tasks. My findings indicate that LLMs excel at analogical and moral reasoning, yet struggle to perform as proficiently on spatial reasoning tasks.
arXiv Detail & Related papers (2023-03-22T22:53:44Z)
Imagination-Augmented Natural Language Understanding [71.51687221130925]
We introduce an Imagination-Augmented Cross-modal (iACE) to solve natural language understanding tasks. iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models. Experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models.
arXiv Detail & Related papers (2022-04-18T19:39:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.