Related papers: On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation

On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation

URL: http://arxiv.org/abs/2406.12221v6
Date: Mon, 26 May 2025 06:22:40 GMT
Title: On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation
Authors: Xueru Wen, Jie Lou, Xinyu Lu, Ji Yuqiu, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, Le Sun,
Abstract summary: Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation.<n>Previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback.<n>We present RLFH, an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior.
Score: 47.35777964373532
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation. To address this critical issue, previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback. In this paper, we present \textit{\b{R}einforcement \b{L}earning \b{f}or \b{H}allucination} (RLFH), an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior through fine-grained feedback signals. RLFH introduces a self-assessment framework where the policy serves as its own judge. Through this framework, responses are automatically decomposed into atomic facts and their truthfulness and informativeness are assessed against external knowledge sources. The resulting fine-grained feedback at the statement level are then converted into token-level dense reward signals. This enables online reinforcement learning to achieve precise and timely optimization without human intervention. Comprehensive evaluations on HotpotQA, SQuADv2, and Biography benchmarks validate RLFH's effectiveness in hallucination mitigation.

Related papers

Post-Training Large Language Models via Reinforcement Learning from Self-Feedback [3.73824942136665]
Large Language Models (LLMs) often produce plausible but poorly-calibrated answers.<n>We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward.
arXiv Detail & Related papers (2025-07-29T15:46:26Z)
Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs [2.217239320172707]
This paper introduces Self-Aware LLM for Unanswerability (SALU), a novel approach that deeply integrates unanswerability detection directly within the generative process.<n>SALU is trained using a multi-task learning framework for both standard Question Answering (QA) and explicit abstention generation for unanswerable queries.<n>It consistently outperforms strong baselines, including hybrid LLM-classifier systems, in overall accuracy for correctly answering or abstaining from questions.
arXiv Detail & Related papers (2025-07-22T18:44:18Z)
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation [3.998312409829935]
We show that Reinforcement Learning from Human Feedback can cause severe, systematic misalignment.<n>We introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback.<n>RLHS substantially improves alignment over RLHF in experiments and human evaluations.
arXiv Detail & Related papers (2025-01-15T06:33:15Z)
Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning [16.883679810267342]
Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination. This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination.
arXiv Detail & Related papers (2024-10-16T00:15:40Z)
Understanding Privacy Risks of Embeddings Induced by Large Language Models [75.96257812857554]
Large language models show early signs of artificial general intelligence but struggle with hallucinations. One promising solution is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation. Recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models.
arXiv Detail & Related papers (2024-04-25T13:10:48Z)
Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback [14.120154004011084]
Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations. We present a novel alignment framework called Reinforcement Learning from Knowledge Feedback (RLKF)
arXiv Detail & Related papers (2024-03-27T08:39:56Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models [68.91592125175787]
Hallucinations pose a significant challenge for the practical implementation of large language models (LLMs) We present Rowen, a novel approach that enhances LLMs with a selective retrieval augmentation process tailored to address hallucinations.
arXiv Detail & Related papers (2024-02-16T11:55:40Z)
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation [71.91287418249688]
Large language models (LLMs) often struggle with factual inaccuracies, even when they hold relevant knowledge. We leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks.
arXiv Detail & Related papers (2024-02-14T15:52:42Z)
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection [39.52923659121416]
We propose to explore the dense semantic information retained within textbfINternal textbfStates for halluctextbfInation textbfDEtection. A simple yet effective textbfEigenScore metric is proposed to better evaluate responses' self-consistency. A test time feature clipping approach is explored to truncate extreme activations in the internal states.
arXiv Detail & Related papers (2024-02-06T06:23:12Z)
Knowledge Verification to Nip Hallucination in the Bud [69.79051730580014]
We demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external knowledge present in the alignment data and the intrinsic knowledge embedded within foundation LLMs. We propose a novel approach called Knowledge Consistent Alignment (KCA), which employs a well-aligned LLM to automatically formulate assessments based on external knowledge. We demonstrate the superior efficacy of KCA in reducing hallucinations across six benchmarks, utilizing foundation LLMs of varying backbones and scales.
arXiv Detail & Related papers (2024-01-19T15:39:49Z)
Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-based Retrofitting [51.7049140329611]
This paper proposes Knowledge Graph-based Retrofitting (KGR) to mitigate factual hallucination during the reasoning process. Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks.
arXiv Detail & Related papers (2023-11-22T11:08:38Z)
Towards Mitigating Hallucination in Large Language Models via Self-Reflection [63.2543947174318]
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks. This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets.
arXiv Detail & Related papers (2023-10-10T03:05:44Z)
Zero-Resource Hallucination Prevention for Large Language Models [45.4155729393135]
"Hallucination" refers to instances where large language models (LLMs) generate factually inaccurate or ungrounded information. We introduce a novel pre-language self-evaluation technique, referred to as SELF-FAMILIARITY, which focuses on evaluating the model's familiarity with the concepts present in the input instruction. We validate SELF-FAMILIARITY across four different large language models, demonstrating consistently superior performance compared to existing techniques.
arXiv Detail & Related papers (2023-09-06T01:57:36Z)
Stateful Offline Contextual Policy Evaluation and Learning [88.9134799076718]
We study off-policy evaluation and learning from sequential data. We formalize the relevant causal structure of problems such as dynamic personalized pricing. We show improved out-of-sample policy performance in this class of relevant problems.
arXiv Detail & Related papers (2021-10-19T16:15:56Z)
A Controllable Model of Grounded Response Generation [122.7121624884747]
Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process. We propose a framework that we call controllable grounded response generation (CGRG) We show that using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.
arXiv Detail & Related papers (2020-05-01T21:22:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.