Related papers: Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

URL: http://arxiv.org/abs/2502.14340v1
Date: Thu, 20 Feb 2025 07:53:11 GMT
Title: Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective
Authors: Ruichen Shao, Bei Li, Gangao Liu, Yang Chen, Xiang Zhou, Jingang Wang, Xunliang Cai, Peng Li,
Abstract summary: We propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter.<n>Our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences.
Score: 22.248134630764497
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at \url{https://github.com/LotuSrc/D2PO}.

Related papers

What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context [56.590259941275434]
RecPO is a preference optimization framework for sequential recommendation.<n>It exploits adaptive reward margins based on inferred preference hierarchies and temporal signals.<n>It mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
arXiv Detail & Related papers (2025-06-02T21:09:29Z)
ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning [14.034412856423529]
Direct Preference Optimization (DPO) has gained attention for its simplicity and computational efficiency in aligning large language models (LLMs)<n>Recent advancements have extended DPO to multimodal scenarios, achieving strong performance.<n>Traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness.<n>We propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization.
arXiv Detail & Related papers (2025-05-25T11:33:08Z)
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization [17.801062522027266]
Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models with human preferences.<n>Existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts.<n>We propose textbfOptimal textbfTransport-based token weighting scheme for enhancing direct textbfPreference textbfOptimization (OTPO)
arXiv Detail & Related papers (2025-05-24T14:44:15Z)
Active Learning for Direct Preference Optimization [59.84525302418018]
Direct preference optimization (DPO) is a form of reinforcement learning from human feedback. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline.
arXiv Detail & Related papers (2025-03-03T00:36:31Z)
AlphaPO - Reward shape matters for LLM alignment [8.688476316386176]
We introduce textbfAlphaPO, a new DAA that helps change the shape of the reward function beyond the standard log reward.<n>Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance.
arXiv Detail & Related papers (2025-01-07T15:46:42Z)
Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function. We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z)
Ordinal Preference Optimization: Aligning Human Preferences via NDCG [28.745322441961438]
We develop an end-to-end preference optimization algorithm by approxing NDCG with a differentiable surrogate loss. OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval.
arXiv Detail & Related papers (2024-10-06T03:49:28Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence [31.03305638930844]
Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models with human preferences.<n>Despite its promising efficacy, DPO faces a notable drawback: "verbosity"<n>We propose that the issue also stems from an inherent algorithmic length reliance in DPO.
arXiv Detail & Related papers (2024-06-16T14:24:30Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences. We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game. Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z)
Disentangling Length from Quality in Direct Preference Optimization [93.74831404396174]
Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models. RLHF is know to exploit biases in human preferences, such as verbosity. We develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality.
arXiv Detail & Related papers (2024-03-28T06:03:47Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.