Related papers: Probing the Robustness of Large Language Models Safety to Latent Perturbations

Probing the Robustness of Large Language Models Safety to Latent Perturbations

URL: http://arxiv.org/abs/2506.16078v1
Date: Thu, 19 Jun 2025 07:03:05 GMT
Title: Probing the Robustness of Large Language Models Safety to Latent Perturbations
Authors: Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang,
Abstract summary: Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
Score: 30.16804362984161
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.

Related papers

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z)
LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation [4.29885665563186]
LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
arXiv Detail & Related papers (2025-09-24T07:31:54Z)
Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift [23.0914017433021]
This work identifies a novel class of deployment phase attacks that exploit a vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text.<n>We propose Search based Embedding Poisoning, a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens.
arXiv Detail & Related papers (2025-09-08T05:00:58Z)
SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress [48.20360860166279]
We introduce SafeCtrl, a lightweight, non-intrusive plugin that first precisely localizes unsafe content.<n>Instead of performing a hard A-to-B substitution, SafeCtrl then suppresses the harmful semantics, allowing the generative process to naturally and coherently resolve into a safe, context-aware alternative.
arXiv Detail & Related papers (2025-08-16T04:28:52Z)
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study [4.724646466332421]
We study whether safety-relevant behavior is concentrated in specific subspaces.<n>We find no evidence of a subspace that selectively governs safety.<n>This suggests that subspace-based defenses may face fundamental limitations.
arXiv Detail & Related papers (2025-05-20T10:41:49Z)
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models [20.42976162135529]
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research.<n>We propose textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM.
arXiv Detail & Related papers (2025-05-12T01:26:50Z)
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender [73.09848497762667]
We propose AdaSteer, an adaptive activation steering method that adjusts model behavior based on input characteristics.<n>AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD)<n>Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
arXiv Detail & Related papers (2025-04-13T07:39:17Z)
Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks.<n>Yet they remain vulnerable to adversarial manipulations such as jailbreaking via prompt injection attacks.<n>We investigated the underlying latent subspaces of safe and jailbroken states by extracting hidden activations from a LLM.
arXiv Detail & Related papers (2025-03-12T04:59:22Z)
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications.<n>Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness.<n>Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z)
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications [69.13807233595455]
Large language models (LLMs) show inherent brittleness in their safety mechanisms. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted.
arXiv Detail & Related papers (2024-02-07T18:34:38Z)
Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs. We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.