Related papers: TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

URL: http://arxiv.org/abs/2509.25760v1
Date: Tue, 30 Sep 2025 04:25:17 GMT
Title: TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
Authors: Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong,
Abstract summary: Large language models (LLMs) are prone to hallucination and untruthful responses.<n>This presents a fundamental challenge for existing methods.<n>We present TruthRL, a general reinforcement learning framework that directly optimize the truthfulness of LLMs.
Score: 47.707273133540745
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.

Related papers

From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones [68.68686526804909]
We show that LLMs can acquire genuinely new skills during RL by composing existing ones.<n>Our experiments show that compositional skill acquired on a source task transfers to a different target task.<n>This transfer happens even without compositional training on the target, requiring only prior knowledge of the target's atomic skills.
arXiv Detail & Related papers (2025-09-29T17:44:27Z)
Unsupervised Hallucination Detection by Inspecting Reasoning Processes [53.15199932086543]
Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data.<n>We propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness.<n>Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
arXiv Detail & Related papers (2025-09-12T06:58:17Z)
Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs [29.9148172868873]
Quantization enables efficient deployment of large language models (LLMs) in resource-constrained environments.<n>We introduce TruthfulnessEval, a comprehensive evaluation framework for assessing the truthfulness of quantized LLMs.<n>We find that while quantized models internally retain truthful representations, they are more susceptible to producing false outputs under misleading prompts.
arXiv Detail & Related papers (2025-08-26T21:01:45Z)
KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality [66.89190766278183]
Large Language Models (LLMs) often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning.<n>We propose Knowledge-enhanced RL, KnowRL, to address the high hallucination in slow-thinking models.<n>KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process.
arXiv Detail & Related papers (2025-06-24T17:17:17Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models [63.98194996746229]
Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization.<n>However, reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations.<n>We propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification.
arXiv Detail & Related papers (2025-05-30T14:23:32Z)
LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models [11.453585039783901]
LEAF: Learning and Evaluation Augmented by Fact-Checking, is a novel approach designed to enhance the factual reliability of large language models (LLMs) The first strategy, Fact-Check-Then-RAG, improves Retrieval-Augmented Generation (RAG) by incorporating fact-checking results to guide the retrieval process without updating model parameters. The second strategy, Learning from Fact-Checks via Self-Training, involves supervised fine-tuning (SFT) on fact-checked responses or applying Simple Preference Optimization (SimPO) with fact-checking as a ranking mechanism.
arXiv Detail & Related papers (2024-10-31T00:18:05Z)
FLAME: Factuality-Aware Alignment for Large Language Models [86.76336610282401]
The conventional alignment process fails to enhance the factual accuracy of large language models (LLMs) We identify factors that lead to hallucination in both alignment steps: supervised fine-tuning (SFT) and reinforcement learning (RL) We propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization.
arXiv Detail & Related papers (2024-05-02T17:54:54Z)
Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression [19.69104070561701]
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy.
arXiv Detail & Related papers (2024-05-01T03:50:09Z)
GRATH: Gradual Self-Truthifying for Large Language Models [63.502835648056305]
GRAdual self-truTHifying (GRATH) is a novel post-processing method to enhance truthfulness of large language models (LLMs) GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs.
arXiv Detail & Related papers (2024-01-22T19:00:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.