A Statistical Framework for Alignment with Biased AI Feedback
- URL: http://arxiv.org/abs/2602.08259v1
- Date: Mon, 09 Feb 2026 04:37:10 GMT
- Title: A Statistical Framework for Alignment with Biased AI Feedback
- Authors: Xintao Xia, Zhiqiu Xia, Linjun Zhang, Zhanrui Cai,
- Abstract summary: AI labels can be systematically biased compared to high-quality human feedback datasets.<n>We develop two debiased alignment methods that accommodate heterogeneous prompt-response distributions and external human feedback sources.<n> Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.
- Score: 20.653424560119554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for large-scale alignment, whereas DIPO serves as a robust, statistically optimal alternative that attains the semiparametric efficiency bound. Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.
Related papers
- Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution [47.604070468150844]
We introduce PEPO, a single-step Direct Preference Optimization-like algorithm to mitigate the well-known over-optimization issue in preference learning.<n> PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets.
arXiv Detail & Related papers (2026-02-05T22:31:07Z) - DeDPO: Debiased Direct Preference Optimization for Diffusion Models [13.068043495097378]
We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback.<n>Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective.<n> Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data.
arXiv Detail & Related papers (2026-02-05T21:11:00Z) - Latent Collective Preference Optimization: A General Framework for Robust LLM Alignment [7.1259212876994695]
We introduce Latent Collective Preference Optimization (LCPO) to learn the latent collective consensus from noisy data.<n>Our experiments demonstrate LCPO's effectiveness as a general framework, consistently enhancing four state-of-the-art alignment algorithms.<n>When applied to Mistral and Llama 3 models, LCPO-enhanced methods achieve substantial win rate gains on AlpacaEval 2 and Arena-Hard, with improvements of up to 7.0% on both benchmarks.
arXiv Detail & Related papers (2025-09-29T01:17:49Z) - Stable Preference Optimization for LLMs: A Bilevel Approach Beyond Direct Preference Optimization [2.384797824772941]
We present a comprehensive analysis of DPO's dynamics from a probability evolution perspective.<n>We propose a theoretically grounded bilevel optimization framework that tightly integrate supervised fine-tuning with an enhanced DPO objective a.k.a. stable preference optimization.
arXiv Detail & Related papers (2025-07-10T12:57:39Z) - Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling [13.917799959981185]
Direct Alignment Algorithms (DAAs) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF)<n>These methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as training progresses.<n>This paper proposes a novel importance-sampling approach to mitigate the over-optimization problem of offline DAAs.
arXiv Detail & Related papers (2025-06-10T10:45:26Z) - Leveraging Robust Optimization for LLM Alignment under Distribution Shifts [51.74394601039711]
Preference alignment methods are increasingly critical for steering large language models to generate outputs consistent with human values.<n>We propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts.
arXiv Detail & Related papers (2025-04-08T09:14:38Z) - A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.