Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
- URL: http://arxiv.org/abs/2503.22165v3
- Date: Fri, 31 Oct 2025 12:56:41 GMT
- Title: Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
- Authors: Zhanke Zhou, Zhaocheng Zhu, Xuan Li, Mikhail Galkin, Xiao Feng, Sanmi Koyejo, Jian Tang, Bo Han,
- Abstract summary: We introduce landscape of thoughts (LoT) to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset.<n>LoT distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks.<n>We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories.
- Score: 58.64449765678416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerous applications of large language models (LLMs) rely on their ability to perform step-by-step reasoning. However, the reasoning behavior of LLMs remains poorly understood, posing challenges to research, development, and safety. To address this gap, we introduce landscape of thoughts (LoT), the first landscape visualization tool to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset. We represent the textual states in a trajectory as numerical features that quantify the states' distances to the answer choices. These features are then visualized in two-dimensional plots using t-SNE. Qualitative and quantitative analysis with the landscape of thoughts effectively distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks. It also uncovers undesirable reasoning patterns, such as low consistency and high uncertainty. Additionally, users can adapt LoT to a model that predicts the property they observe. We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories. Empirically, this verifier boosts the reasoning accuracy and the test-time scaling effect. The code is publicly available at: https://github.com/tmlr-group/landscape-of-thoughts.
Related papers
- Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization [9.193078163792427]
Chain-of-Thought (CoT) empowers Large Language Models (LLMs) to tackle complex problems.<n>Recent latent reasoning approaches attempt to optimize efficiency by performing reasoning within continuous hidden states.<n>We introduce PLaT, a framework that reformulates latent reasoning as planning by fundamentally decouple reasoning from verbalization.
arXiv Detail & Related papers (2026-01-29T07:38:18Z) - Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images [53.373427633330515]
We propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT.<n>Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs.<n>In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern.<n>In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern.
arXiv Detail & Related papers (2025-12-19T07:44:43Z) - Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark [112.46338388724116]
We introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path.<n>To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training.
arXiv Detail & Related papers (2025-12-04T18:55:34Z) - Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis [17.897469162097085]
We investigate how visual distractors affect test-time scaling in vision-language models.<n>We find that visual distractors differ fundamentally from textual ones.<n>We propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
arXiv Detail & Related papers (2025-11-26T13:49:08Z) - ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models [11.263321053154364]
ERGO is a reasoning-driven perception-leveraging multimodal context to determine where to focus.<n>We develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception.<n>Our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency.
arXiv Detail & Related papers (2025-09-26T07:15:19Z) - Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT [24.085953089267772]
We show how OpenAI o3 and GPT-4o fail to grasp basic physical laws, spatial interactions, and causal effects in complex scenes.<n>We introduce MVPBench, a benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT)<n> Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains.
arXiv Detail & Related papers (2025-05-30T03:48:59Z) - Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [67.87579664988199]
TON is a two-stage training strategy for vision-language models (VLMs)<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z) - LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception [105.78609483419115]
We introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks.
We propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions.
We demonstrate notable improvements over existing visual reasoning data-generation methods.
arXiv Detail & Related papers (2025-04-21T18:10:38Z) - Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning [89.17086632436363]
We introduce a synthetic multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs.
Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios.
To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size.
arXiv Detail & Related papers (2025-04-04T17:57:22Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z) - Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [60.04718679054704]
We introduce Sketch-of-Thought (SoT), a novel prompting framework.<n>It combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize token usage.<n>SoT achieves token reductions of 76% with negligible accuracy impact.
arXiv Detail & Related papers (2025-03-07T06:57:17Z) - Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data.<n>We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z) - DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests [69.00444996464662]
We present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios.<n>Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings.<n>We investigate training strategies that leverage relevant entities to improve visual reasoning.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Take A Step Back: Rethinking the Two Stages in Visual Reasoning [57.16394309170051]
This paper revisits visual reasoning with a two-stage perspective.
It is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner.
The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks.
arXiv Detail & Related papers (2024-07-29T02:56:19Z) - Self-Contradictory Reasoning Evaluation and Detection [31.452161594896978]
We investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers.
We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense.
We find that GPT-4 can detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans.
arXiv Detail & Related papers (2023-11-16T06:22:17Z) - Chain of Images for Intuitively Reasoning [23.692458865558486]
We present a Chain of Images (CoI) approach to convert complex language reasoning problems to simple pattern recognition.
We have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving.
In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions.
arXiv Detail & Related papers (2023-11-09T11:14:51Z) - Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play
Multi-Character Belief Tracker [72.09076317574238]
ToM is a plug-and-play approach to investigate the belief states of characters in reading comprehension.
We show that ToM enhances off-the-shelf neural network theory mind in a zero-order setting while showing robust out-of-distribution performance compared to supervised baselines.
arXiv Detail & Related papers (2023-06-01T17:24:35Z) - Faithful Reasoning Using Large Language Models [12.132449274592668]
We show how LMs can be made to perform faithful multi-step reasoning via a process whose causal structure mirrors the underlying logical structure of the problem.
Our approach works by chaining together reasoning steps, where each step results from calls to two fine-tuned LMs.
We demonstrate the effectiveness of our model on multi-step logical deduction and scientific question-answering, showing that it outperforms baselines on final answer accuracy.
arXiv Detail & Related papers (2022-08-30T13:44:41Z) - Object-Centric Diagnosis of Visual Reasoning [118.36750454795428]
This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness.
We develop a diagnostic model, namely Graph Reasoning Machine.
Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
arXiv Detail & Related papers (2020-12-21T18:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.