Related papers: Language Model Alignment with Elastic Reset

Language Model Alignment with Elastic Reset

URL: http://arxiv.org/abs/2312.07551v1
Date: Wed, 6 Dec 2023 22:53:34 GMT
Title: Language Model Alignment with Elastic Reset
Authors: Michael Noukhovitch, Samuel Lavoie, Florian Strub, Aaron Courville
Abstract summary: We argue that commonly-used test metrics are insufficient to measure how different algorithms tradeoff between reward and drift. We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark.
Score: 8.503863369800191
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Finetuning language models with reinforcement learning (RL), e.g. from human feedback (HF), is a prominent method for alignment. But optimizing against a reward model can improve on reward while degrading performance in other areas, a phenomenon known as reward hacking, alignment tax, or language drift. First, we argue that commonly-used test metrics are insufficient and instead measure how different algorithms tradeoff between reward and drift. The standard method modified the reward with a Kullback-Lieber (KL) penalty between the online and initial model. We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. We periodically reset the online model to an exponentially moving average (EMA) of itself, then reset the EMA model to the initial model. Through the use of an EMA, our model recovers quickly after resets and achieves higher reward with less drift in the same number of steps. We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark, outperforms all baselines in a medium-scale RLHF-like IMDB mock sentiment task and leads to a more performant and more aligned technical QA chatbot with LLaMA-7B. Code available at github.com/mnoukhov/elastic-reset.

Related papers

Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference [27.205035058481553]
We propose assigning scores to every sentence, introducing an intermediate-grained reward model. A novel attention mechanism is introduced to aggregate the scores of all sentences into a response-level score. Our method outperforms the response-level reward model by 2.7% on RewardBench.
arXiv Detail & Related papers (2025-03-01T14:11:04Z)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs. We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms. Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z)
CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling [27.86204841898399]
Reward modeling in large language models is susceptible to reward hacking. We propose Context-Aware Reward Modeling (CARMO) to mitigate this problem. We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1% improvement on Reward Bench.
arXiv Detail & Related papers (2024-10-28T21:18:49Z)
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z)
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms [50.808123629394245]
Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline. This work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.
arXiv Detail & Related papers (2024-06-05T03:41:37Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model. We show that the proposed algorithms converge to the stationary solutions of the IRL problem. Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Rethinking Model Re-Basin and Linear Mode Connectivity [1.1510009152620668]
We decompose re-normalization into rescaling and reshift, uncovering that rescaling plays a crucial role in re-normalization. We identify that the merged model suffers from the issue of activation collapse and magnitude collapse. We propose a new perspective to unify the re-basin and pruning, under which a lightweight yet effective post-pruning technique is derived.
arXiv Detail & Related papers (2024-02-05T17:06:26Z)
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z)
Scaling Laws for Reward Model Overoptimization [19.93331579503503]
We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
arXiv Detail & Related papers (2022-10-19T17:56:10Z)
Aligned Cross Entropy for Non-Autoregressive Machine Translation [120.15069387374717]
We propose aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models. AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks.
arXiv Detail & Related papers (2020-04-03T16:24:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.