Related papers: Narrow fine-tuning erodes safety alignment in vision-language agents

Narrow fine-tuning erodes safety alignment in vision-language agents

URL: http://arxiv.org/abs/2602.16931v1
Date: Wed, 18 Feb 2026 22:47:28 GMT
Title: Narrow fine-tuning erodes safety alignment in vision-language agents
Authors: Idhant Gulati, Shivam Raval,
Abstract summary: Lifelong multimodal agents must continuously adapt to new tasks through post-training.<n>We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment.
Score: 0.12441041004077093
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10\% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.

Related papers

Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously [51.03213216886717]
We take the position that the dominant paradigm of General Alignment reaches a structural ceiling in settings with conflicting values.<n>We introduce Edge Alignment as a distinct approach in which systems preserve multi dimensional value structure.
arXiv Detail & Related papers (2026-02-23T16:51:43Z)
OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z)
Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z)
Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures [70.48661957773449]
Emergent Misalignment refers to a failure mode in which fine-tuning large language models on narrowly scoped data induces broadly misaligned behavior.<n>Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning.
arXiv Detail & Related papers (2026-01-30T15:28:42Z)
The Realignment Problem: When Right becomes Wrong in LLMs [6.8304813545377]
The alignment of Large Language Models with human values is central to their safe deployment, yet current models fail to keep pace with evolving norms and policies.<n>Existing unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates.<n>We introduce TRACE, a framework for principled unlearning that reconceives realignment as a programmatic policy problem.
arXiv Detail & Related papers (2025-11-04T14:52:58Z)
DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models [59.45605332033458]
Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
arXiv Detail & Related papers (2025-10-12T23:21:34Z)
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z)
Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z)
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets [64.96967819446553]
This paper investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks.<n>High similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks.<n>Low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%.
arXiv Detail & Related papers (2025-06-05T17:59:55Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.