Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?
- URL: http://arxiv.org/abs/2509.06350v1
- Date: Mon, 08 Sep 2025 05:45:37 GMT
- Title: Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?
- Authors: Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang,
- Abstract summary: We propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix.<n>Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions.
- Score: 3.5954282637912787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.
Related papers
- Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models [0.0]
We focus on the prevalent Greedy Coordinate Gradient (GCG) attack and identify a previously underexplored attack axis in jailbreak attacks.<n>Using GCG as a case study, we show that both optimizing attacks to generate prefixes instead of suffixes and varying adversarial token position during evaluation substantially influence attack success rates.
arXiv Detail & Related papers (2026-02-03T08:53:35Z) - Bag of Tricks for Subverting Reasoning-based Safety Guardrails [62.139297207938036]
We present a bag of jailbreak methods that subvert the reasoning-based guardrails.<n>Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization.
arXiv Detail & Related papers (2025-10-13T16:16:44Z) - Geneshift: Impact of different scenario shift on Jailbreaking LLM [55.26229741296822]
We propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts.<n>We show that GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.
arXiv Detail & Related papers (2025-04-10T20:02:35Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models [16.83476701024932]
Greedy Coordinate Gradient (GCG) method has demonstrated ability to automatically generate adversarial suffixes that jailbreak state-of-the-art LLMs.<n>We propose the Model Attack Gradient Index GCG (MAGIC) that addresses the Indirect Effect by exploiting the gradient information of the suffix tokens.<n>Experiments on AdvBench show that MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates (ASR) on par or even higher.
arXiv Detail & Related papers (2024-12-11T18:37:56Z) - An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.<n>Our threat model checks if a given jailbreak is likely to occur in the distribution of text.<n>We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models [16.938267820586024]
We propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG.
Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost.
arXiv Detail & Related papers (2024-10-20T11:27:41Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - Improved Techniques for Optimization-Based Jailbreaking on Large Language Models [78.32176751215073]
Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques.
We present several improved (empirical) techniques for optimization-based jailbreaks like GCG.
The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate.
arXiv Detail & Related papers (2024-05-31T17:07:15Z) - Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs.
In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks.
We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z) - Boosting Jailbreak Attack with Momentum [5.047814998088682]
Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks.<n>textbfAccelerated GbfCG (textbfMAC) attack integrates a momentum term into the gradient to boost and stabilize the random search for tokens in adversarial prompts.
arXiv Detail & Related papers (2024-05-02T12:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.