SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization
- URL: http://arxiv.org/abs/2602.02383v2
- Date: Tue, 03 Feb 2026 13:58:22 GMT
- Title: SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization
- Authors: Maksim Afanasyev, Illarion Iov,
- Abstract summary: We introduce SLIME, a reference-free alignment objective designed to decouple preference learning from generation quality.<n>Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response's absolute likelihood. This can lead to unlearning, where the model degrades the probability of high-quality outputs to satisfy margin constraints, and formatting collapse caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.
Related papers
- Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z) - Hard Constraints Meet Soft Generation: Guaranteed Feasibility for LLM-based Combinatorial Optimization [14.17648636921649]
We introduce FALCON, a framework that ensures 100% feasibility through three key innovations.<n>FALCON achieves perfect feasibility while matching or exceeding the solution quality of state-of-the-art neural and LLM-based solvers.
arXiv Detail & Related papers (2026-02-01T08:09:06Z) - Optimistic Feasible Search for Closed-Loop Fair Threshold Decision-Making [0.0]
We study online learning of a one-dimensional threshold policy from bandit feedback.<n>We propose Optimistic Feasible Search (OFS), a simple grid-based method that maintains confidence bounds for reward and constraint residuals.
arXiv Detail & Related papers (2025-12-26T10:44:40Z) - Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models [7.422627253922975]
We introduce Certifiable Safe-RLHF, a cost model trained on a large-scale corpus to assign semantically grounded safety scores.<n>With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed, eliminating the need for dual-variable updates.<n> Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts.
arXiv Detail & Related papers (2025-10-03T21:24:41Z) - Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization [13.97375970293678]
DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability.<n>We propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations.<n>First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics.<n>Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality.<n>Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability.
arXiv Detail & Related papers (2025-08-20T10:17:29Z) - A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement [47.95776810771774]
Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model alignment.<n>In this paper, we identify a common pitfall of margin-based methods.<n>We demystify the reasons behind these problematic behaviors.
arXiv Detail & Related papers (2024-10-17T17:52:01Z) - Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.<n>$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.<n>$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability.
Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level.
We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.