Related papers: Direct Advantage Regression: Aligning LLMs with Online AI Reward

Direct Advantage Regression: Aligning LLMs with Online AI Reward

URL: http://arxiv.org/abs/2504.14177v1
Date: Sat, 19 Apr 2025 04:44:32 GMT
Title: Direct Advantage Regression: Aligning LLMs with Online AI Reward
Authors: Li He, He Zhao, Stephen Wan, Dadong Wang, Lina Yao, Tongliang Liu,
Abstract summary: Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF)<n>We propose Direct Advantage Regression (DAR) to optimize policy improvement through weighted supervised fine-tuning.<n>Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference.
Score: 59.78549819431632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI deprives LLMs from learning more fine-grained AI supervision beyond binary signals. In this paper, we propose Direct Advantage Regression (DAR), a simple alignment algorithm using online AI reward to optimize policy improvement through weighted supervised fine-tuning. As an RL-free approach, DAR maintains theoretical consistency with online RLHF pipelines while significantly reducing implementation complexity and improving learning efficiency. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference. Additionally, evaluations using GPT-4-Turbo and MT-bench show that DAR outperforms both OAIF and online RLHF baselines.

Related papers

Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models [22.10168313140081]
We introduce ERL-VLM, an enhanced rating-based reinforcement learning method that learns reward functions from AI feedback.<n>ERL-VLM queries large vision-language models for absolute ratings of individual trajectories, enabling more expressive feedback.<n>We demonstrate that ERL-VLM significantly outperforms existing VLM-based reward generation methods.
arXiv Detail & Related papers (2025-06-15T12:05:08Z)
REAL: Response Embedding-based Alignment for LLMs [1.9513983244114355]
Response Embedding-based Alignment for LLMs is a strategy for constructing a high-quality training dataset.<n>We show that choosing dissimilar response pairs enhances the direct alignment of LLMs while reducing inherited labeling errors.<n>Our findings suggest that focusing on distinct pairs can reduce the label error to improve the efficiency of LLM alignment, saving up to 65% of annotators' work.
arXiv Detail & Related papers (2024-09-17T22:40:54Z)
Direct Language Model Alignment from Online AI Feedback [78.40436231613754]
Direct alignment from preferences (DAP) methods have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF) In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF) uses an LLM as annotator: on each training, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback.
arXiv Detail & Related papers (2024-02-07T12:31:13Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
AI Alignment and Social Choice: Fundamental Limitations and Policy Implications [0.0]
Reinforcement learning with human feedback (RLHF) has emerged as the key framework for AI alignment. In this paper, we investigate a specific challenge in building RLHF systems that respect democratic norms. We show that aligning AI agents with the values of all individuals will always violate certain private ethical preferences of an individual user.
arXiv Detail & Related papers (2023-10-24T17:59:04Z)
Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback [5.3113139864044046]
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RLAIF offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
arXiv Detail & Related papers (2023-09-01T05:53:33Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Constitutional AI: Harmlessness from AI Feedback [19.964791766072132]
We experiment with methods for training a harmless AI assistant through self-improvement. The only human oversight is provided through a list of rules or principles. We are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them.
arXiv Detail & Related papers (2022-12-15T06:19:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.