Related papers: Understanding the Impact of Sampling Quality in Direct Preference Optimization

Related papers

How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics [65.67654005892469]
We show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences.<n>We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies.<n>Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods.
arXiv Detail & Related papers (2026-02-12T17:11:08Z)
AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models [54.56296715999545]
Reinforcement learning from human feedback shows promise for aligning diffusion and flow models.<n>Policy optimization methods such as GRPO suffer from inefficient and static sampling strategies.<n>We propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy.
arXiv Detail & Related papers (2026-02-06T16:09:50Z)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z)
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z)
What Matters in Data for DPO? [6.208229499655634]
Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models with human preferences.<n>We provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives.<n>We show that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses may have relatively limited impact.
arXiv Detail & Related papers (2025-08-23T16:00:30Z)
Improving DAPO from a Mixed-Policy Perspective [0.0]
This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm.<n>We first propose a method that incorporates a pre-trained, stable guiding policy to provide off-policy experience.<n>We then extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies.
arXiv Detail & Related papers (2025-07-17T09:12:09Z)
Stable Preference Optimization for LLMs: A Bilevel Approach Beyond Direct Preference Optimization [2.384797824772941]
We present a comprehensive analysis of DPO's dynamics from a probability evolution perspective.<n>We propose a theoretically grounded bilevel optimization framework that tightly integrate supervised fine-tuning with an enhanced DPO objective a.k.a. stable preference optimization.
arXiv Detail & Related papers (2025-07-10T12:57:39Z)
Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization [35.335072390336855]
We study the goal of preference optimization as learning the differential information required to update a reference policy into a target policy.<n>First, we find that DPO's log-ratio reward is uniquely justified when preferences encode the Differential Information needed to update a reference policy into the target policy.<n>Second, we discuss how commonly observed training dynamics in DPO, including changes in log-likelihood and policy exploration, stem from a power-law DID relationship.
arXiv Detail & Related papers (2025-05-29T17:59:50Z)
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO [51.22869332661607]
We decompose the performance gap between reinforcement learning from human feedback and direct preference optimization under a representation gap.<n>We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications.
arXiv Detail & Related papers (2025-05-26T09:54:02Z)
SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment [46.55132297735257]
We propose a novel Self-Guided Direct Preference Optimization algorithm, i.e., SGDPO, which incorporates a pilot term to steer the gradient flow during the optimization process.<n>We provide a detailed theoretical analysis of our proposed method and elucidate its operational mechanism.
arXiv Detail & Related papers (2025-05-18T14:19:23Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
Gradient Imbalance in Direct Preference Optimization [26.964127989679596]
We propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism.<n>Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance.
arXiv Detail & Related papers (2025-02-28T08:47:03Z)
Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking [50.325021634589596]
We propose a Tailored Optimization Preference (TailorPO) framework for aligning diffusion models with human preference.<n>Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues.<n> Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.
arXiv Detail & Related papers (2025-02-01T16:08:43Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
3D-Properties: Identifying Challenges in DPO and Charting a Path Forward [17.27880657597116]
We revisit DPO, analyzing its theoretical foundations and empirical performance.<n>We identify three key properties, termed 3D properties, that emerge from DPO's learning process.<n>We propose simple regularization techniques that improve training stability and performance.
arXiv Detail & Related papers (2024-06-11T14:59:24Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.<n>We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)<n>Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.<n>We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback. In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.