TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
- URL: http://arxiv.org/abs/2512.05943v3
- Date: Thu, 11 Dec 2025 19:26:00 GMT
- Title: TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
- Authors: Shima Imani, Seungwhan Moon, Lambert Mathias, Lu Zhang, Babak Damavandi,
- Abstract summary: We introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation.<n>At its core, TRACEleverages Auxiliary Reasoning Sets to decompose complex problems.<n>Our experiments show that consistency across ARS correlates with final-answer correctness.<n>TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths.
- Score: 9.607579442309639
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.
Related papers
- Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z) - Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation [46.38008143057758]
Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable.<n>This work argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment.<n>Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges.
arXiv Detail & Related papers (2026-02-10T00:45:24Z) - CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs [53.199517625701475]
CoG is a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation.<n>CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2026-01-16T07:27:40Z) - Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z) - Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning [8.01259760303241]
We investigate whether a partially completed reasoning chain can be reliably continued by another model.<n>We use token-level log-probability thresholds to truncate reasoning at early, mid, and late stages from our baseline models.<n>Our findings point towards interchangeability as an emerging behavioral property of reasoning models.
arXiv Detail & Related papers (2025-12-16T02:56:44Z) - Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking [11.763473690046721]
Reasoning-augmented vision language models generate explicit chains of thought that promise greater capability and transparency.<n>Models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction.<n>We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image.
arXiv Detail & Related papers (2025-12-13T07:04:42Z) - ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning [2.1461777157838724]
We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in large language models (LLMs) reasoning.<n>Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability.<n>We further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability.
arXiv Detail & Related papers (2025-12-08T18:26:58Z) - Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks [54.31998314008198]
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks.<n>We attribute this limitation to textbfreasoning overconfidence: a tendency to express undue certainty in an incomplete solution set.<n>We propose the textbfcognitive-rigidity hypothesis, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths.
arXiv Detail & Related papers (2025-12-01T14:35:06Z) - Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards [24.40159537923851]
We develop a principled method for developing robust and scalable reasoning in Audio Large Language Models.<n>We achieve state-of-the-art results on MMAU Test-mini, substantially outperforming Gemini 2.5 Pro and GPT-4o Audio, and near-human-level performance on MMSU reasoning tasks.
arXiv Detail & Related papers (2025-10-23T06:18:10Z) - FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning [62.452350134196934]
FaithCoT-Bench is a unified benchmark for instance-level CoT unfaithfulness detection.<n>Our framework formulates unfaithfulness detection as a discriminative decision problem.<n>FaithCoT-Bench sets a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.
arXiv Detail & Related papers (2025-10-05T05:16:54Z) - A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [58.32070787537946]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z) - ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [64.93140713419561]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs.<n>Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection.<n>We introduce ConCISE, a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient.
arXiv Detail & Related papers (2025-05-08T01:40:40Z) - Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z) - Calibrating Reasoning in Language Models with Internal Consistency [18.24350001344488]
Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks.<n>LLMs often generate text with obvious mistakes and contradictions.<n>In this work, we investigate reasoning in LLMs through the lens of internal representations.
arXiv Detail & Related papers (2024-05-29T02:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.