Related papers: Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

URL: http://arxiv.org/abs/2510.17733v1
Date: Mon, 20 Oct 2025 16:45:43 GMT
Title: Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations
Authors: Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, Faeze Brahman,
Abstract summary: We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR)<n>For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates.<n>In short-form question answering, the model learns abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge.
Score: 103.16279860448874
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.

Related papers

What Does Flow Matching Bring To TD Learning? [28.717975688380488]
Flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL)<n>We show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance.<n>We argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms.
arXiv Detail & Related papers (2026-03-04T17:51:30Z)
From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z)
Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates [53.3717573880076]
We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates.<n>JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly.<n>Experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods.
arXiv Detail & Related papers (2026-01-26T14:16:51Z)
Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL [19.659532349434418]
Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models.<n>Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness.<n>We propose Dynamic Reasoning Efficiency Reward (DRER) -- a plug-and-play RL reward framework that reshapes both reward and advantage signals.
arXiv Detail & Related papers (2025-09-07T11:52:18Z)
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty [59.97939500426759]
This paper describes RLCR, an approach to training reasoning models that jointly improves accuracy and confidence estimation.<n>We show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy.<n>We also demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration.
arXiv Detail & Related papers (2025-07-22T17:56:01Z)
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning [43.310209758380886]
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs)<n>We decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR)<n>We show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs.
arXiv Detail & Related papers (2025-06-02T06:10:54Z)
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z)
SR-Reward: Taking The Path More Traveled [8.818066308133108]
We propose a novel method for learning reward functions directly from offline demonstrations.<n>Unlike traditional inverse reinforcement learning (IRL), our approach decouples the reward function from the learner's policy.<n>Our reward function, called textitSR-Reward, leverages successor representation (SR) to encode a state based on expected future states' visitation under the demonstration policy and transition dynamics.
arXiv Detail & Related papers (2025-01-04T16:21:10Z)
Solving Offline Reinforcement Learning with Decision Tree Regression [0.0]
This study presents a novel approach to addressing offline reinforcement learning problems by reframing them as regression tasks. We introduce two distinct frameworks: return-conditioned and return-weighted decision tree policies. Despite the simplification inherent in this reformulated approach to offline RL, our agents demonstrate performance that is at least on par with the established methods.
arXiv Detail & Related papers (2024-01-21T23:50:46Z)
DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback. We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.