Related papers: Alignment-Aware Model Adaptation via Feedback-Guided Optimization

Alignment-Aware Model Adaptation via Feedback-Guided Optimization

URL: http://arxiv.org/abs/2602.02258v1
Date: Mon, 02 Feb 2026 16:03:16 GMT
Title: Alignment-Aware Model Adaptation via Feedback-Guided Optimization
Authors: Gaurav Bhatt, Aditya Chinchure, Jiawei Zhou, Leonid Sigal,
Abstract summary: Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks.<n>We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization.
Score: 27.93864970404945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks; however, standard approaches largely optimize task objectives in isolation and do not account for secondary yet critical alignment objectives (e.g., safety and hallucination avoidance). As a result, downstream fine-tuning can degrade alignment and fail to correct pre-existing misaligned behavior. We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization. Our method introduces an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients on a per-sample basis, prioritizing uncertain or misaligned cases while allowing well-aligned examples to follow standard supervised updates. The framework further learns abstention behavior for fully misaligned inputs, incorporating conservative responses directly into the fine-tuned model. Experiments on general and domain-specific instruction-tuning benchmarks demonstrate consistent reductions in harmful and hallucinated outputs without sacrificing downstream task performance. Additional analyses show robustness to adversarial fine-tuning, prompt-based attacks, and unsafe initializations, establishing adaptively gated alignment optimization as an effective approach for alignment-preserving and alignment-recovering model adaptation.

Related papers

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control [55.366871033602145]
We argue that robustness failures cannot be addressed by data-centric methods alone.<n>We propose ShaPO, a geometry-aware preference optimization framework.<n>ShaPO enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace.
arXiv Detail & Related papers (2026-02-07T03:46:33Z)
Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment [6.428964221372943]
We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor.<n>GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.
arXiv Detail & Related papers (2026-02-04T00:40:21Z)
AdaptNC: Adaptive Nonconformity Scores for Uncertainty-Aware Autonomous Systems in Dynamic Environments [7.201566646241765]
Conformal Prediction methods maintain target coverage by adaptively scaling the conformal threshold.<n>We show that this fixed geometry leads to highly conservative, volume-inefficient prediction regions when environments undergo structural shifts.<n>We propose textbfAdaptNC, a framework for the joint online adaptation of both the nonconformity score parameters and the conformal threshold.
arXiv Detail & Related papers (2026-02-02T04:41:35Z)
Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z)
Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z)
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping [63.33669214116784]
GRPO-Guard is a simple yet effective enhancement to existing GRPO frameworks.<n>It restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates.<n>It substantially mitigates implicit over-optimization without relying on heavy KL regularization.
arXiv Detail & Related papers (2025-10-25T14:51:17Z)
Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization [13.97375970293678]
DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability.<n>We propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations.<n>First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics.<n>Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality.<n>Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability.
arXiv Detail & Related papers (2025-08-20T10:17:29Z)
NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback [0.0]
We present NPO, an alignment-aware learning framework that operationalizes feedback-driven adaptation in human-in-the-loop decision systems.<n>NPO introduces a formalization of alignment loss that is measurable, supervisable, and reducible under structured feedback.
arXiv Detail & Related papers (2025-07-22T11:23:18Z)
LookAhead Tuning: Safer Language Models via Partial Answer Previews [62.529794567687354]
Fine-tuning enables large language models to adapt to specific domains, but often compromises their previously established safety alignment.<n>We introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning.
arXiv Detail & Related papers (2025-03-24T18:11:42Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional. We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.