Related papers: Semantic Representation Attack against Aligned Large Language Models

Semantic Representation Attack against Aligned Large Language Models

URL: http://arxiv.org/abs/2509.19360v2
Date: Mon, 20 Oct 2025 09:45:40 GMT
Title: Semantic Representation Attack against Aligned Large Language Models
Authors: Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau,
Abstract summary: Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs.<n>Current methods typically target exact affirmative responses, suffering from limited convergence, unnatural prompts, and high computational costs.<n>We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs.
Score: 18.13997425681567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41\% averaged across 18 LLMs, including 100\% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.

Related papers

PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance [10.105673138616483]
Large Language Models (LLMs) are increasingly integrated into real-world applications, from virtual assistants to autonomous agents.<n>As attackers evolve with paraphrased, obfuscated, and even multi-task injection strategies, existing benchmarks are no longer sufficient to capture the full spectrum of emerging threats.<n>We propose PromptSleuth, a semantic-oriented defense framework that detects prompt injection by reasoning over task-level intent rather than surface features.
arXiv Detail & Related papers (2025-08-28T15:19:07Z)
Mitigating Jailbreaks with Intent-Aware LLMs [42.48292327349576]
Large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions.<n>We propose Intent-FT, a simple and lightweight fine-tuning approach that explicitly trains LLMs to infer the underlying intent of an instruction before responding.<n> Empirically, Intent-FT consistently mitigates all evaluated attack categories, with no single attack exceeding a 50% success rate.
arXiv Detail & Related papers (2025-08-16T15:03:33Z)
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective [57.57786477441956]
We propose an adaptive and semantic optimization problem over the population of responses.<n>Our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
arXiv Detail & Related papers (2025-02-24T15:34:48Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks. We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
Adversarial Text Purification: A Large Language Model Approach for Defense [25.041109219049442]
Adversarial purification is a defense mechanism for safeguarding classifiers against adversarial attacks. We propose a novel adversarial text purification that harnesses the generative capabilities of Large Language Models. Our proposed method demonstrates remarkable performance over various classifiers, improving their accuracy under the attack by over 65% on average.
arXiv Detail & Related papers (2024-02-05T02:36:41Z)
Semantic-Preserving Adversarial Code Comprehension [75.76118224437974]
We propose Semantic-Preserving Adversarial Code Embeddings (SPACE) to find the worst-case semantic-preserving attacks. Experiments and analysis demonstrate that SPACE can stay robust against state-of-the-art attacks while boosting the performance of PrLMs for code.
arXiv Detail & Related papers (2022-09-12T10:32:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.