Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models
- URL: http://arxiv.org/abs/2601.20126v1
- Date: Tue, 27 Jan 2026 23:42:07 GMT
- Title: Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models
- Authors: Abha Jha, Akanksha Mahajan, Ashwath Vaithinathan Aravindan, Praveen Saravanan, Sai Sailaja Policharla, Sonal Chaturbhuj Gehlot,
- Abstract summary: Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains.<n>This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention alongside correctness to promote intellectual humility.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ($-1$, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs $\approx -0.25$ to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic-Slice/rl-abstention.
Related papers
- From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence [9.80421132842862]
Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs)<n>RAG is susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient.<n>We propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws.
arXiv Detail & Related papers (2026-01-08T02:47:33Z) - Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation [12.503662455234954]
We show that modern language models produce confident hallucinations even when wrong answers carry catastrophic consequences.<n>We propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards instead of binary.
arXiv Detail & Related papers (2025-11-14T17:20:45Z) - Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations [103.16279860448874]
We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR)<n>For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates.<n>In short-form question answering, the model learns abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge.
arXiv Detail & Related papers (2025-10-20T16:45:43Z) - Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning [50.20267980386502]
We learn a dense, token-level reward model for process supervision directly from expert demonstrations.<n>The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets.
arXiv Detail & Related papers (2025-10-02T09:55:26Z) - Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models [56.055015597319674]
Reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs)<n>Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs.<n>We propose textitCo-rewarding, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views.
arXiv Detail & Related papers (2025-08-01T08:09:14Z) - RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality [24.299312059430704]
Unlearning is the task of selectively removing specific information from a model without retraining from scratch or degrading overall utility.<n>Existing methods often rely on large-scale forget and retain datasets, and suffer from unnatural responses, poor generalization, or catastrophic utility loss.<n>We propose Reinforcement UnLearning (RULE), an efficient framework that formulates unlearning as a refusal boundary optimization problem.
arXiv Detail & Related papers (2025-06-08T14:38:39Z) - The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning [37.13807960501503]
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs)<n>We decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR)<n>We show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs.
arXiv Detail & Related papers (2025-06-02T06:10:54Z) - Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.