Related papers: Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

URL: http://arxiv.org/abs/2602.05897v1
Date: Thu, 05 Feb 2026 17:15:12 GMT
Title: Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models
Authors: Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li,
Abstract summary: Small reasoning models (SRMs) are prone to hallucinations, especially in intermediate reasoning steps.<n>Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained chain-of-thought evaluation.<n>We propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model.
Score: 59.6715047267181
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.

Related papers

Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z)
Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought? [79.86483056611105]
Reasoning LLMs generate step-by-step chains of thought before giving an answer.<n>How robust are these reasoning traces to disruptions that occur within them?<n>We introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps.
arXiv Detail & Related papers (2026-02-07T10:02:58Z)
Learning to Reason Faithfully through Step-Level Faithfulness Maximization [35.23601691819328]
Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs)<n>Most RLVR pipelines rely on sparse outcome-based rewards, providing little supervision over intermediate steps.<n>We propose FaithRL, a general reinforcement learning framework that directly optimize reasoning faithfulness.
arXiv Detail & Related papers (2026-02-03T13:28:17Z)
P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering [51.04492568024515]
We introduce Probabilistic Process Supervision (P2S), a novel framework for fine-grained process rewards.<n>P2S provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps.
arXiv Detail & Related papers (2026-01-28T14:35:20Z)
Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning [25.562101968892833]
Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs)<n>Existing approaches improve RLVR via token-level entropy or sequence-level length control, but lack a semantically grounded, step-level measure of reasoning progress.<n>We propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination.
arXiv Detail & Related papers (2026-01-07T11:36:01Z)
Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking [11.763473690046721]
Reasoning-augmented vision language models generate explicit chains of thought that promise greater capability and transparency.<n>Models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction.<n>We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image.
arXiv Detail & Related papers (2025-12-13T07:04:42Z)
Efficient Reasoning via Reward Model [24.105621725286497]
Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs)<n>LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking.<n>We introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score.
arXiv Detail & Related papers (2025-11-12T09:51:07Z)
Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models [26.88030285500965]
Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as "overthinking"<n>We introduce textbfStep Pruner (SP), an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps.<n>Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning.
arXiv Detail & Related papers (2025-10-04T13:24:26Z)
Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [64.93140713419561]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs.<n>Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection.<n>We introduce ConCISE, a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient.
arXiv Detail & Related papers (2025-05-08T01:40:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.