Alignment-Aware Model Adaptation via Feedback-Guided Optimization
- URL: http://arxiv.org/abs/2602.02258v1
- Date: Mon, 02 Feb 2026 16:03:16 GMT
- Title: Alignment-Aware Model Adaptation via Feedback-Guided Optimization
- Authors: Gaurav Bhatt, Aditya Chinchure, Jiawei Zhou, Leonid Sigal,
- Abstract summary: Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks.<n>We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization.
- Score: 27.93864970404945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks; however, standard approaches largely optimize task objectives in isolation and do not account for secondary yet critical alignment objectives (e.g., safety and hallucination avoidance). As a result, downstream fine-tuning can degrade alignment and fail to correct pre-existing misaligned behavior. We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization. Our method introduces an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients on a per-sample basis, prioritizing uncertain or misaligned cases while allowing well-aligned examples to follow standard supervised updates. The framework further learns abstention behavior for fully misaligned inputs, incorporating conservative responses directly into the fine-tuned model. Experiments on general and domain-specific instruction-tuning benchmarks demonstrate consistent reductions in harmful and hallucinated outputs without sacrificing downstream task performance. Additional analyses show robustness to adversarial fine-tuning, prompt-based attacks, and unsafe initializations, establishing adaptively gated alignment optimization as an effective approach for alignment-preserving and alignment-recovering model adaptation.
Related papers
- Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control [55.366871033602145]
We argue that robustness failures cannot be addressed by data-centric methods alone.<n>We propose ShaPO, a geometry-aware preference optimization framework.<n>ShaPO enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace.
arXiv Detail & Related papers (2026-02-07T03:46:33Z) - Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment [6.428964221372943]
We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor.<n>GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.
arXiv Detail & Related papers (2026-02-04T00:40:21Z) - AdaptNC: Adaptive Nonconformity Scores for Uncertainty-Aware Autonomous Systems in Dynamic Environments [7.201566646241765]
Conformal Prediction methods maintain target coverage by adaptively scaling the conformal threshold.<n>We show that this fixed geometry leads to highly conservative, volume-inefficient prediction regions when environments undergo structural shifts.<n>We propose textbfAdaptNC, a framework for the joint online adaptation of both the nonconformity score parameters and the conformal threshold.
arXiv Detail & Related papers (2026-02-02T04:41:35Z) - Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z) - Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z) - GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping [63.33669214116784]
GRPO-Guard is a simple yet effective enhancement to existing GRPO frameworks.<n>It restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates.<n>It substantially mitigates implicit over-optimization without relying on heavy KL regularization.
arXiv Detail & Related papers (2025-10-25T14:51:17Z) - Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization [13.97375970293678]
DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability.<n>We propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations.<n>First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics.<n>Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality.<n>Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability.
arXiv Detail & Related papers (2025-08-20T10:17:29Z) - NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback [0.0]
We present NPO, an alignment-aware learning framework that operationalizes feedback-driven adaptation in human-in-the-loop decision systems.<n>NPO introduces a formalization of alignment loss that is measurable, supervisable, and reducible under structured feedback.
arXiv Detail & Related papers (2025-07-22T11:23:18Z) - LookAhead Tuning: Safer Language Models via Partial Answer Previews [62.529794567687354]
Fine-tuning enables large language models to adapt to specific domains, but often compromises their previously established safety alignment.<n>We introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning.
arXiv Detail & Related papers (2025-03-24T18:11:42Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional.
We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.