Related papers: Explainable reinforcement learning from human feedback to improve alignment

Explainable reinforcement learning from human feedback to improve alignment

URL: http://arxiv.org/abs/2512.13837v1
Date: Mon, 15 Dec 2025 19:18:35 GMT
Title: Explainable reinforcement learning from human feedback to improve alignment
Authors: Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, Minghui Zhu,
Abstract summary: We investigate whether this human improvement strategy can be applied to improving reinforcement learning from human feedback.<n>In particular, it is observed in the literature that LMs tuned by RLHF can still output unsatisfactory responses.<n>This paper proposes a method to improve the unsatisfactory responses by correcting their causes.
Score: 33.905626357906414
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A common and effective strategy for humans to improve an unsatisfactory outcome in daily life is to find a cause of this outcome and correct the cause. In this paper, we investigate whether this human improvement strategy can be applied to improving reinforcement learning from human feedback (RLHF) for alignment of language models (LMs). In particular, it is observed in the literature that LMs tuned by RLHF can still output unsatisfactory responses. This paper proposes a method to improve the unsatisfactory responses by correcting their causes. Our method has two parts. The first part proposes a post-hoc explanation method to explain why an unsatisfactory response is generated to a prompt by identifying the training data that lead to this response. We formulate this problem as a constrained combinatorial optimization problem where the objective is to find a set of training data closest to this prompt-response pair in a feature representation space, and the constraint is that the prompt-response pair can be decomposed as a convex combination of this set of training data in the feature space. We propose an efficient iterative data selection algorithm to solve this problem. The second part proposes an unlearning method that improves unsatisfactory responses to some prompts by unlearning the training data that lead to these unsatisfactory responses and, meanwhile, does not significantly degrade satisfactory responses to other prompts. Experimental results demonstrate that our algorithm can improve RLHF.

Related papers

TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z)
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning [11.31665596884142]
Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models with human preferences.<n>Most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments.<n>We propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications.
arXiv Detail & Related papers (2025-04-03T16:16:35Z)
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback [12.7099489697479]
We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking.<n>We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness.
arXiv Detail & Related papers (2025-03-28T08:26:41Z)
In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
REAL: Response Embedding-based Alignment for LLMs [1.9513983244114355]
We propose a strategy for constructing a high-quality training dataset that focuses on acquiring the less ambiguous preference pairs.<n>Experiments show that choosing dissimilar response pairs enhances the direct alignment of LLMs.<n>Findings suggest that focusing on distinct pairs can reduce the label error and improve LLM alignment efficiency.
arXiv Detail & Related papers (2024-09-17T22:40:54Z)
Recursive Chain-of-Feedback Prevents Performance Degradation from Redundant Prompting [0.4662017507844857]
This paper studies such repetitive behavior and its effect by defining a novel setting, Chain-of-Feedback (CoF) To alleviate these troubles, we propose a novel method, Recursive Chain-of-Feedback (R-CoF)
arXiv Detail & Related papers (2024-02-05T00:44:28Z)
Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment [105.34140537748546]
We propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment.
arXiv Detail & Related papers (2023-11-07T15:36:40Z)
Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment. We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems. By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z)
Enabling Language Models to Implicitly Learn Self-Improvement [49.16868302881804]
Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. We propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data.
arXiv Detail & Related papers (2023-10-02T04:29:40Z)
Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL [62.824464372594576]
We aim to enhance arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization. We introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data.
arXiv Detail & Related papers (2023-09-13T01:12:52Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Query-Policy Misalignment in Preference-Based Reinforcement Learning [21.212703100030478]
We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay. Our method achieves substantial gains in both human feedback and RL sample efficiency.
arXiv Detail & Related papers (2023-05-27T07:55:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.