Related papers: Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation

Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation

URL: http://arxiv.org/abs/2511.11500v1
Date: Fri, 14 Nov 2025 17:20:45 GMT
Title: Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation
Authors: Mohamad Amin Mohamadi, Tianhao Wang, Zhiyuan Li,
Abstract summary: We show that modern language models produce confident hallucinations even when wrong answers carry catastrophic consequences.<n>We propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards instead of binary.
Score: 12.503662455234954
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern language models fail a fundamental requirement of trustworthy intelligence: knowing when not to answer. Despite achieving impressive accuracy on benchmarks, these models produce confident hallucinations, even when wrong answers carry catastrophic consequences. Our evaluations on GSM8K, MedQA and GPQA show frontier models almost never abstain despite explicit warnings of severe penalties, suggesting that prompts cannot override training that rewards any answer over no answer. As a remedy, we propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards (+1 correct, 0 abstention, -$λ$ error) instead of binary. Controlled experiments on logic puzzles reveal that varying $λ$ produces distinct models along a Pareto frontier, where each training penalty yields the optimal model for its corresponding risk regime: low penalties produce aggressive answerers, high penalties conservative abstainers. We then introduce two inference strategies that exploit trained abstention as a coordination signal: cascading routes queries through models with decreasing risk tolerance, while self-cascading re-queries the same model on abstention. Both outperform majority voting with lower computational cost. These results establish abstention as a first-class training objective that transforms ``I don't know'' from failure into a coordination signal, enabling models to earn trust through calibrated honesty about their limits.

Related papers

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models [59.6715047267181]
Small reasoning models (SRMs) are prone to hallucinations, especially in intermediate reasoning steps.<n>Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained chain-of-thought evaluation.<n>We propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model.
arXiv Detail & Related papers (2026-02-05T17:15:12Z)
Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know" [47.930782177987446]
Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations.<n>We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks.<n>Because factual knowledge is stable while hallucinations are agreement, cross-regime provides a precise signal of internal uncertainty.
arXiv Detail & Related papers (2026-02-04T18:39:58Z)
Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models [0.0]
Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains.<n>This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention alongside correctness to promote intellectual humility.
arXiv Detail & Related papers (2026-01-27T23:42:07Z)
Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations [103.16279860448874]
We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR)<n>For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates.<n>In short-form question answering, the model learns abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge.
arXiv Detail & Related papers (2025-10-20T16:45:43Z)
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense [36.71358559780692]
HERO is a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way.<n> HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks.
arXiv Detail & Related papers (2025-10-08T17:09:41Z)
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers [90.50039419576807]
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling.<n>To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary $0,1$ during training.<n>This choice carries a cost: it introduces textitfalse negatives (rejecting correct answers, FNs) and textitfalse positives (accepting incorrect ones, FPs)
arXiv Detail & Related papers (2025-10-01T13:56:44Z)
Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models [56.055015597319674]
Reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs)<n>Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs.<n>We propose textitCo-rewarding, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views.
arXiv Detail & Related papers (2025-08-01T08:09:14Z)
ConfRAG: Confidence-Guided Retrieval-Augmenting Generation [41.78313747240249]
We introduce ConfQA, a fine-tuning strategy that reduces hallucination rates from 20-40% to below 5% across multiple factuality benchmarks.<n>We propose ConfRAG, a triggering strategy that invokes RAG only when the model responses with unsure.<n>This framework achieves accuracy above 95% in ideal case while reducing unnecessary external retrievals by over 30%.
arXiv Detail & Related papers (2025-06-08T22:51:46Z)
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning [37.13807960501503]
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs)<n>We decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR)<n>We show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs.
arXiv Detail & Related papers (2025-06-02T06:10:54Z)
LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models [69.68379406317682]
We introduce a listener-aware finetuning method (LACIE) to calibrate implicit and explicit confidence markers. We show that LACIE models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We find that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers.
arXiv Detail & Related papers (2024-05-31T17:16:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.