Why Alignment Must Precede Distillation: A Minimal Working Explanation
- URL: http://arxiv.org/abs/2509.23667v1
- Date: Sun, 28 Sep 2025 06:12:19 GMT
- Title: Why Alignment Must Precede Distillation: A Minimal Working Explanation
- Authors: Sungmin Cha, Kyunghyun Cho,
- Abstract summary: We show that the standard KD -> Align workflow diminishes the model's capacity to align rare yet desirable behaviors.<n>We demonstrate that alignment must first be performed on a high-recall reference before distillation.
- Score: 50.784080714897776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For efficiency, preference alignment is often performed on compact, knowledge-distilled (KD) models. We argue this common practice introduces a significant limitation by overlooking a key property of the alignment's reference model: its distributional recall. We show that the standard KD -> Align workflow diminishes the model's capacity to align rare yet desirable behaviors, even under strong preference signals. We instead demonstrate that reversing the pipeline (i.e., Align -> KD) is essential: alignment must first be performed on a high-recall reference before distillation. Our contributions are threefold. First, we provide a minimal working explanation of how the reference model constrains preference alignment objectives at a fundamental level. Second, we validate this theory in a controllable Mixture-of-Gaussians experiment, where low-recall anchoring consistently results in suboptimal model performance. Finally, we demonstrate that the same phenomenon holds in LLM alignment with the SmolLM2 family: models aligned after KD fail to effectively align target behaviors, resulting in substantially lower reward and target precision. In contrast, our proposed Align -> KD pipeline robustly aligns these behaviors, yielding models with superior target-oriented metrics and lower variance. Together, these results establish reference-model recall as a first-order design choice in alignment, offering a clear principle: alignment must precede distillation.
Related papers
- Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance [8.038055165320195]
We propose a simple method that improves alignment without retraining the base model.<n>To further enhance generalization, we decouple preference learning into two modules trained on positive and negative data.<n>We evaluate on Stable Diffusion 1.5 and Stable Diffusion XL with Pick-a-Pic v2 and HPDv3, showing consistent quantitative and qualitative gains.
arXiv Detail & Related papers (2026-02-21T11:18:52Z) - Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models [130.8912476550625]
We propose a positive-unlabeled (PU) reinforcement learning distillation method for on-premise small-model deployment.<n>Our method distills the teacher's preference-optimization capability from black-box generations into a locally trainable student.<n>Experiments demonstrate that our method achieves consistently strong performance under a low-cost setting.
arXiv Detail & Related papers (2026-01-28T15:14:50Z) - EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs [9.412828452977553]
Existing approaches reinforce successful reasoning paths, incurring a substantial calibration cost.<n>This failure has been characterized as a form of model collapse in alignment.<n>We proposeEpiCaR as a training objective that jointly optimize reasoning performance and calibration.
arXiv Detail & Related papers (2026-01-11T06:21:13Z) - Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging [35.958192369444056]
"alignment tax" of post-training is typically framed as a drop in task accuracy.<n>We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse.<n>We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment.
arXiv Detail & Related papers (2025-10-20T11:12:41Z) - Preference Trajectory Modeling via Flow Matching for Sequential Recommendation [50.077447974294586]
Sequential recommendation predicts each user's next item based on their historical interaction sequence.<n>FlowRec is a simple yet effective sequential recommendation framework.<n>We construct a personalized behavior-based prior distribution to replace Gaussian noise and learn a vector field to model user preference trajectories.
arXiv Detail & Related papers (2025-08-25T02:55:42Z) - Convergent Linear Representations of Emergent Misalignment [1.3286418032136589]
Fine-tuning large language models can cause them to develop broadly misaligned behaviours.<n>We study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct.
arXiv Detail & Related papers (2025-06-13T09:39:54Z) - Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model [12.063078727764045]
We argue that alignment via reinforcement learning from human feedback lacks theoretical justification and incentivizes deterministic solutions.<n>We propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization.<n>We empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO.
arXiv Detail & Related papers (2025-06-02T10:36:31Z) - Normalized Attention Guidance: Universal Negative Guidance for Diffusion Models [57.20761595019967]
We present Normalized Attention Guidance (NAG), an efficient, training-free mechanism that applies extrapolation in attention space with L1-based normalization and refinement.<n>NAG restores effective negative guidance where CFG collapses while maintaining fidelity.<n>NAG generalizes across architectures (UNet, DiT), sampling regimes (few-step, multi-step), and modalities (image, video)
arXiv Detail & Related papers (2025-05-27T13:30:46Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - Curvature-Informed SGD via General Purpose Lie-Group Preconditioners [6.760212042305871]
We present a novel approach to accelerate gradient descent (SGD) by utilizing curvature information.
Our approach involves two preconditioners: a matrix-free preconditioner and a low-rank approximation preconditioner.
We demonstrate that Preconditioned SGD (PSGD) outperforms SoTA on Vision, NLP, and RL tasks.
arXiv Detail & Related papers (2024-02-07T03:18:00Z) - Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models.
We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data.
Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.