Related papers: SimPO: Simple Preference Optimization with a Reference-Free Reward

SimPO: Simple Preference Optimization with a Reference-Free Reward

URL: http://arxiv.org/abs/2405.14734v3
Date: Fri, 01 Nov 2024 20:05:19 GMT
Title: SimPO: Simple Preference Optimization with a Reference-Free Reward
Authors: Yu Meng, Mengzhou Xia, Danqi Chen,
Abstract summary: Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm. We propose SimPO, a simpler yet more effective approach to DPO. SimPO consistently and significantly outperforms DPO without substantially increasing response length.
Score: 43.136307294076545
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further improving the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including AlpacaEval 2, MT-Bench, and Arena-Hard. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Gemma-2-9B-it, achieves a 72.4% length-controlled win rate on AlpacaEval 2, a 59.1% win rate on Arena-Hard, and ranks 1st on Chatbot Arena among <10B models with real user votes.

Related papers

Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization [66.67988187816185]
We aim to emphscale up the number of on-policy samples via repeated random sampling to improve alignment performance. Our experiments reveal that this strategy leads to a emphdecline in performance as the sample size increases. We introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
arXiv Detail & Related papers (2025-02-24T04:22:57Z)
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective [22.248134630764497]
We propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. Our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences.
arXiv Detail & Related papers (2025-02-20T07:53:11Z)
R.I.P.: Better Models by Survival of the Fittest Prompts [51.2293437372642]
We introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair.
arXiv Detail & Related papers (2025-01-30T18:50:25Z)
AlphaPO -- Reward shape matters for LLM alignment [8.753297661521007]
We introduce textbfAlphaPO, a new DAA that helps change the shape of the reward function beyond the standard log reward. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance.
arXiv Detail & Related papers (2025-01-07T15:46:42Z)
Evolving Alignment via Asymmetric Self-Play [52.3079697845254]
We introduce a general open-ended RLHF framework that casts alignment as an asymmetric game between two players. This framework of Evolving Alignment via Asymmetric Self-Play (eva) results in a simple and efficient approach that can utilize any existing RLHF algorithm for scalable alignment.
arXiv Detail & Related papers (2024-10-31T08:15:32Z)
General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently. We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO) Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z)
Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level [50.897438358317686]
We show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5%$ length-controlled win rate against $texttGPT-4 Preview$ on AlpacaEval 2.0.
arXiv Detail & Related papers (2024-06-17T17:55:38Z)
Bootstrapping Language Models with DPO Implicit Rewards [45.68366127605774]
Direct preference optimization (DPO) has greatly simplified the process from past work in reinforcement learning from human feedback. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment and achieves superior performance.
arXiv Detail & Related papers (2024-06-14T06:57:18Z)
Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences. We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game. Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z)
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences [21.5605000515622]
This paper studies post-training large language models (LLMs) using preference feedback from an oracle to help a model iteratively improve over itself. We introduce Direct Nash Optimization (DNO), a provable and efficient algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. In our experiments, a resulting 7B parameter Orca-2.5 model achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaE 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model
arXiv Detail & Related papers (2024-04-04T17:56:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.