Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution
- URL: http://arxiv.org/abs/2602.16154v1
- Date: Wed, 18 Feb 2026 02:55:55 GMT
- Title: Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution
- Authors: Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani, Mohit Bansal, Elias Stengel-Eskin,
- Abstract summary: Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
- Score: 79.98699884805636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who "execute" the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and mistake injection AOC -- while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.
Related papers
- On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs [15.301640007799735]
We show that simple, controlled textual perturbations--misleading captions or incorrect chain-of-thought (CoT) traces--cause substantial drops in robustness and confidence.<n>To better understand these vulnerabilities, we analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off.
arXiv Detail & Related papers (2026-02-13T01:12:00Z) - Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought? [79.86483056611105]
Reasoning LLMs generate step-by-step chains of thought before giving an answer.<n>How robust are these reasoning traces to disruptions that occur within them?<n>We introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps.
arXiv Detail & Related papers (2026-02-07T10:02:58Z) - Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models [59.6715047267181]
Small reasoning models (SRMs) are prone to hallucinations, especially in intermediate reasoning steps.<n>Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained chain-of-thought evaluation.<n>We propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model.
arXiv Detail & Related papers (2026-02-05T17:15:12Z) - Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards [24.40159537923851]
We develop a principled method for developing robust and scalable reasoning in Audio Large Language Models.<n>We achieve state-of-the-art results on MMAU Test-mini, substantially outperforming Gemini 2.5 Pro and GPT-4o Audio, and near-human-level performance on MMSU reasoning tasks.
arXiv Detail & Related papers (2025-10-23T06:18:10Z) - From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z) - FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness [7.721663297811698]
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for improving large language model performance on complex tasks.<n>Recent work shows that reasoning steps often fail to causally influence the final answer, creating brittle and untrustworthy outputs.<n>We introduce Faithful Reasoning via Intervention Training (FRIT), a scalable alignment method that trains models to produce causally consistent reasoning.
arXiv Detail & Related papers (2025-09-10T07:07:17Z) - CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models [29.95434387343843]
We propose a unified framework that mitigates length bias through three components.<n>CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning.
arXiv Detail & Related papers (2025-07-21T15:07:59Z) - Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z) - ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [64.93140713419561]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs.<n>Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection.<n>We introduce ConCISE, a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient.
arXiv Detail & Related papers (2025-05-08T01:40:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.