Related papers: Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

URL: http://arxiv.org/abs/2507.08020v1
Date: Tue, 08 Jul 2025 03:01:00 GMT
Title: Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation
Authors: Zhibo Zhang, Yuxi Li, Kailong Wang, Shuai Yuan, Ling Shi, Haoyu Wang,
Abstract summary: Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity.<n>Embedding space poisoning is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms.<n>We propose ETTA, a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations.
Score: 13.971909819796762
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses.

Related papers

Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z)
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z)
Learning Safety Constraints for Large Language Models [41.95596134688853]
Large language models (LLMs) pose significant safety risks through harmful outputs and vulnerability to adversarial attacks.<n>We propose SaP, a geometric approach to safety that learns and enforces multiple safety constraints directly in the model's representation space.<n>We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs.
arXiv Detail & Related papers (2025-05-30T10:30:24Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Safety Alignment Can Be Not Superficial With Explicit Safety Signals [8.297367440457508]
Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially.<n>This paper identifies a fundamental cause of this superficiality: existing alignment approaches presume that models can implicitly learn a safety-related reasoning task during the alignment process.<n>By explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity.
arXiv Detail & Related papers (2025-05-19T20:40:46Z)
Safety Pretraining: Toward the Next Generation of Safe AI [61.2816320807586]
We present a data-centric pretraining framework that builds safety into the model from the start.<n>Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date, generated via recontextualization of harmful web data; and (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content.
arXiv Detail & Related papers (2025-04-23T17:58:08Z)
Representation Bending for Large Language Model Safety [27.842146980762934]
Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks pose significant challenges.<n>This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs.<n>RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates.
arXiv Detail & Related papers (2025-04-02T09:47:01Z)
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models [13.887770576598646]
Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions.<n>This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters.<n>TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions.
arXiv Detail & Related papers (2024-12-11T08:44:15Z)
Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.