Small-Margin Preferences Still Matter-If You Train Them Right
- URL: http://arxiv.org/abs/2602.00954v1
- Date: Sun, 01 Feb 2026 01:15:55 GMT
- Title: Small-Margin Preferences Still Matter-If You Train Them Right
- Authors: Jinlong Pang, Zhaowei Zhu, Na Di, Yichi Zhang, Yaxuan Wang, Chen Qian, Yang Liu,
- Abstract summary: We show that pair difficulty interacts strongly with the optimization objective.<n>We propose MixDPO, a simple yet effective difficulty-aware training strategy.<n>We show that MixDPO consistently improves alignment over DPO and a range of widely-used variants.
- Score: 24.058773077803895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference optimization methods such as DPO align large language models (LLMs) using paired comparisons, but their effectiveness can be highly sensitive to the quality and difficulty of preference pairs. A common heuristic treats small-margin (ambiguous) pairs as noisy and filters them out. In this paper, we revisit this assumption and show that pair difficulty interacts strongly with the optimization objective: when trained with preference-based losses, difficult pairs can destabilize training and harm alignment, yet these same pairs still contain useful supervision signals when optimized with supervised fine-tuning (SFT). Motivated by this observation, we propose MixDPO, a simple yet effective difficulty-aware training strategy that (i) orders preference data from easy to hard (a curriculum over margin-defined difficulty), and (ii) routes difficult pairs to an SFT objective while applying a preference loss to easy pairs. This hybrid design provides a practical mechanism to leverage ambiguous pairs without incurring the optimization failures often associated with preference losses on low-margin data. Across three LLM-judge benchmarks, MixDPO consistently improves alignment over DPO and a range of widely-used variants, with particularly strong gains on AlpacaEval~2 length-controlled (LC) win rate.
Related papers
- DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations [22.299736215070343]
Multimodal Large Language Models (MLLMs) tend to overemphasize easily distinguishable preference pairs.<n>We propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process.
arXiv Detail & Related papers (2026-01-02T09:41:54Z) - AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment [25.526336903358757]
offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models.<n>We propose Adaptive Margin-attached Preference Optimization (AMaPO), a simple yet principled algorithm.<n>AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones.
arXiv Detail & Related papers (2025-11-12T14:51:59Z) - Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization [0.0]
Margin-Adaptive Direct Preference Optimization provides a stable, data-preserving, and instance-level solution.<n>We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape.<n>It achieves performance gains of up to +33.3% on High Quality data and +10.5% on Low Quality data over the next-best method.
arXiv Detail & Related papers (2025-10-06T20:09:37Z) - From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models [90.45197506653341]
Large reasoning models generate intermediate reasoning traces before producing final answers.<n> aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored.<n>A common workaround optimized a single sampled trajectory, which introduces substantial gradient variance from trace sampling.
arXiv Detail & Related papers (2025-10-06T17:58:01Z) - ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization [48.50761200321113]
We introduce ConfPO, a method for preference learning in Large Language Models (LLMs)<n>It identifies and optimize preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or compute.<n> Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs.
arXiv Detail & Related papers (2025-06-10T11:54:22Z) - Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples [38.79705507444374]
We show that preference data vary in difficulty, and overly difficult examples hinder alignment.<n>We introduce Selective DPO, which filters out overly difficult examples.<n>This simple adjustment improves alignment performance by 9-16% in win rates on the AlpacaEval 2 benchmark.
arXiv Detail & Related papers (2025-02-11T17:01:11Z) - A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z) - Constrain Alignment with Sparse Autoencoders [45.131670081186]
Feature-level constrained Preference Optimization is a novel method designed to simplify the alignment process while ensuring stability.<n>Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence.
arXiv Detail & Related papers (2024-11-12T07:54:13Z) - TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward.<n>TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z) - Geometric-Averaged Preference Optimization for Soft Preference Labels [78.2746007085333]
Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic.<n>In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function.
arXiv Detail & Related papers (2024-09-10T17:54:28Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.