Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models
- URL: http://arxiv.org/abs/2602.03265v1
- Date: Tue, 03 Feb 2026 08:53:35 GMT
- Title: Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models
- Authors: Hicham Eddoubi, Umar Faruk Abdullahi, Fadi Hassan,
- Abstract summary: We focus on the prevalent Greedy Coordinate Gradient (GCG) attack and identify a previously underexplored attack axis in jailbreak attacks.<n>Using GCG as a case study, we show that both optimizing attacks to generate prefixes instead of suffixes and varying adversarial token position during evaluation substantially influence attack success rates.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have seen widespread adoption across multiple domains, creating an urgent need for robust safety alignment mechanisms. However, robustness remains challenging due to jailbreak attacks that bypass alignment via adversarial prompts. In this work, we focus on the prevalent Greedy Coordinate Gradient (GCG) attack and identify a previously underexplored attack axis in jailbreak attacks typically framed as suffix-based: the placement of adversarial tokens within the prompt. Using GCG as a case study, we show that both optimizing attacks to generate prefixes instead of suffixes and varying adversarial token position during evaluation substantially influence attack success rates. Our findings highlight a critical blind spot in current safety evaluations and underline the need to account for the position of adversarial tokens in the adversarial robustness evaluation of LLMs.
Related papers
- Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks? [3.5954282637912787]
We propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix.<n>Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions.
arXiv Detail & Related papers (2025-09-08T05:45:37Z) - The Resurgence of GCG Adversarial Attacks on Large Language Models [4.157278627741554]
We present a systematic appraisal of GCG and its variant, TGCG, across open-source landscapes.<n>Attack success rates decrease with model size, reflecting increasing complexity.<n> coding prompts are more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector.
arXiv Detail & Related papers (2025-08-30T07:04:29Z) - ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [64.32925552574115]
ARMOR is a large language model that analyzes jailbreak strategies and extracts the core intent.<n> ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks.
arXiv Detail & Related papers (2025-07-14T09:05:54Z) - Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses [10.08464073347558]
We introduce Checkpoint-GCG, a white-box attack against fine-tuning-based defenses.<n>We show Checkpoint-GCG to achieve up to $96%$ attack success rate (ASR) against the strongest defense.
arXiv Detail & Related papers (2025-05-21T16:43:17Z) - SecReEvalBench: A Multi-turned Security Resilience Evaluation Benchmark for Large Language Models [4.039934762896615]
We present SecReEvalBench, the Security Resilience Evaluation Benchmark.<n>It defines four novel metrics: Prompt Attack Resilience Score, Prompt Attack Refusal Logic Score, Chain-Based Attack Resilience Score and Chain-Based Attack Rejection Time Score.<n>We also introduce a dataset customized for the benchmark, which incorporates both neutral and malicious prompts.
arXiv Detail & Related papers (2025-05-12T14:09:24Z) - Enhancing Adversarial Attacks through Chain of Thought [0.0]
gradient-based adversarial attacks are particularly effective against aligned large language models (LLMs)
This paper proposes enhancing the universality of adversarial attacks by integrating CoT prompts with the greedy coordinate gradient (GCG) technique.
arXiv Detail & Related papers (2024-10-29T06:54:00Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - Guidance Through Surrogate: Towards a Generic Diagnostic Attack [101.36906370355435]
We develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed Guided Projected Gradient Attack (G-PGA)
Our modified attack does not require random restarts, large number of attack iterations or search for an optimal step-size.
More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
arXiv Detail & Related papers (2022-12-30T18:45:23Z) - On Trace of PGD-Like Adversarial Attacks [77.75152218980605]
Adversarial attacks pose safety and security concerns for deep learning applications.
We construct Adrial Response Characteristics (ARC) features to reflect the model's gradient consistency.
Our method is intuitive, light-weighted, non-intrusive, and data-undemanding.
arXiv Detail & Related papers (2022-05-19T14:26:50Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z) - Reliable evaluation of adversarial robustness with an ensemble of
diverse parameter-free attacks [65.20660287833537]
In this paper we propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function.
We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness.
arXiv Detail & Related papers (2020-03-03T18:15:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.