Related papers: How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics

How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics

URL: http://arxiv.org/abs/2602.12180v1
Date: Thu, 12 Feb 2026 17:11:08 GMT
Title: How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics
Authors: Yurong Chen, Yu He, Michael I. Jordan, Fan Yao,
Abstract summary: We show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences.<n>We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies.<n>Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods.
Score: 65.67654005892469
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard methods for aligning large language models with human preferences learn from pairwise comparisons among sampled candidate responses and regularize toward a reference policy. Despite their effectiveness, the effects of sampling and reference choices are poorly understood theoretically. We investigate these effects through Identity Preference Optimization, a widely used preference alignment framework, and show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences. We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies, reflecting a common practice of model-generated preference data. We prove that these dynamics can exhibit persistent oscillations or entropy collapse for certain parameter choices, and characterize regimes that guarantee stability. Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods. Experiments on real-world preference data validate our findings.

Related papers

Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z)
HEAL: A Hypothesis-Based Preference-Aware Analysis Framework [32.45006553398745]
This paper presents a textbfHypothesis-based PrtextbfEference-aware textbfAnatextbfLysis Framework (HEAL)<n>It formulates preference alignment as a re-ranking process within hypothesis spaces.<n>The framework incorporates two complementary metrics: ranking accuracy for evaluating ordinal consistency and preference strength correlation for assessing continuous alignment.
arXiv Detail & Related papers (2025-08-27T14:30:08Z)
Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment [16.059172179404467]
The alignment of language models (LMs) with human preferences is critical for building reliable AI systems.<n>Recently, Direct Preference Optimization (DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data.<n>We show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates.
arXiv Detail & Related papers (2025-08-14T11:05:18Z)
Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals [46.58760908162995]
We propose a novel, theoretically-grounded data selection principle for large language models.<n>We prove the optimality of this strategy by analyzing the loss bounds of the Direct Preference Optimization objective.<n>Our strategy achieves over 10% relative improvement against both the standard holistic preference and a stronger oracle.
arXiv Detail & Related papers (2025-08-11T05:43:02Z)
What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context [56.590259941275434]
RecPO is a preference optimization framework for sequential recommendation.<n>It exploits adaptive reward margins based on inferred preference hierarchies and temporal signals.<n>It mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
arXiv Detail & Related papers (2025-06-02T21:09:29Z)
On Symmetric Losses for Robust Policy Optimization with Noisy Preferences [55.8615920580824]
This work focuses on reward modeling, a core component in reinforcement learning from human feedback.<n>We propose a principled framework for robust policy optimization under noisy preferences.<n>We prove that symmetric losses enable successful policy optimization even under noisy labels.
arXiv Detail & Related papers (2025-05-30T15:30:43Z)
Learning from negative feedback, or positive feedback or both [21.95277469346728]
We introduce a novel approach that decouples learning from positive and negative feedback.<n>A key contribution is demonstrating stable learning from negative feedback alone.
arXiv Detail & Related papers (2024-10-05T14:04:03Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling. We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.