Related papers: Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

URL: http://arxiv.org/abs/2405.17931v1
Date: Tue, 28 May 2024 07:53:40 GMT
Title: Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment
Authors: Keming Lu, Bowen Yu, Fei Huang, Yang Fan, Runji Lin, Chang Zhou,
Abstract summary: Large Language Models (LLMs) are designed to align with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) In this paper, we show that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax. It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.
Score: 47.682736928029996
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effectively aligning Large Language Models (LLMs) with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) poses a central challenge in Reinforcement Learning from Human Feedback (RLHF). In this paper, we first discover that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax at the cost of alignment reward. Inspired by this, we propose integrating the RL policy and SFT models at each optimization step in RLHF to continuously regulate the training direction, introducing the Online Merging Optimizer. Specifically, we merge gradients with the parameter differences between SFT and pretrained models, effectively steering the gradient towards maximizing rewards in the direction of SFT optimization. We demonstrate that our optimizer works well with different LLM families, such as Qwen and LLaMA, across various model sizes ranging from 1.8B to 8B, various RLHF algorithms like DPO and KTO, and existing model merging methods. It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.

Related papers

Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs [13.292104357930866]
SASR is a step-wise adaptive hybrid training framework for large language models.<n>It unifies SFT and RL and dynamically balances the two throughout optimization.<n> Experimental results demonstrate that SASR outperforms SFT, RL, and static hybrid training methods.
arXiv Detail & Related papers (2025-05-19T12:10:17Z)
Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z)
Simplify RLHF as Reward-Weighted SFT: A Variational Method [34.222095430239555]
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. We propose a novel simplification of RLHF from the perspective of variational inference. We transform the alignment objective into a reward-driven supervised fine-tuning form to obtain noticeable improvement on training stability and effectiveness.
arXiv Detail & Related papers (2025-02-16T07:22:00Z)
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models [54.381650481255235]
We introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (O) Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions. Empirical evaluations on eight recent LLMs, both open and closed-sourced, demonstrate that DRPO significantly enhances alignment performance.
arXiv Detail & Related papers (2024-11-13T16:15:38Z)
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF [22.88031166401938]
This paper presents SALSA, a novel approach designed to overcome limitations by creating a more flexible and better located reference model. We show that SALSA fosters better exploration, achieving higher rewards and improving model robustness, out-of-distribution, and performance.
arXiv Detail & Related papers (2024-11-04T04:53:43Z)
SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences. Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation. Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z)
Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
Mitigating the Alignment Tax of RLHF [76.4300447532456]
aligning LLMs under Reinforcement Learning with Human Feedback can lead to forgetting pretrained abilities, also known as the alignment tax. We propose model averaging to maximize alignment performance while incurring minimal alignment tax. We validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B.
arXiv Detail & Related papers (2023-09-12T14:16:54Z)
Accelerated Federated Learning with Decoupled Adaptive Optimization [53.230515878096426]
federated learning (FL) framework enables clients to collaboratively learn a shared model while keeping privacy of training data on clients. Recently, many iterations efforts have been made to generalize centralized adaptive optimization methods, such as SGDM, Adam, AdaGrad, etc., to federated settings. This work aims to develop novel adaptive optimization methods for FL from the perspective of dynamics of ordinary differential equations (ODEs)
arXiv Detail & Related papers (2022-07-14T22:46:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.