Related papers: The Resurgence of GCG Adversarial Attacks on Large Language Models

The Resurgence of GCG Adversarial Attacks on Large Language Models

URL: http://arxiv.org/abs/2509.00391v1
Date: Sat, 30 Aug 2025 07:04:29 GMT
Title: The Resurgence of GCG Adversarial Attacks on Large Language Models
Authors: Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, Peikang Hu,
Abstract summary: We present a systematic appraisal of GCG and its variant, TGCG, across open-source landscapes.<n>Attack success rates decrease with model size, reflecting increasing complexity.<n> coding prompts are more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector.
Score: 4.157278627741554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient (GCG) algorithm, has emerged as a powerful method for jailbreaking large language models (LLMs). In this paper, we present a systematic appraisal of GCG and its annealing-augmented variant, T-GCG, across open-source LLMs of varying scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack effectiveness on both safety-oriented prompts (AdvBench) and reasoning-intensive coding prompts. Our study reveals three key findings: (1) attack success rates (ASR) decrease with model size, reflecting the increasing complexity and non-convexity of larger models' loss landscapes; (2) prefix-based heuristics substantially overestimate attack effectiveness compared to GPT-4o semantic judgments, which provide a stricter and more realistic evaluation; and (3) coding-related prompts are significantly more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector. In addition, preliminary results with T-GCG show that simulated annealing can diversify adversarial search and achieve competitive ASR under prefix evaluation, though its benefits under semantic judgment remain limited. Together, these findings highlight the scalability limits of GCG, expose overlooked vulnerabilities in reasoning tasks, and motivate further development of annealing-inspired strategies for more robust adversarial evaluation.

Related papers

CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs [53.199517625701475]
CoG is a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation.<n>CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2026-01-16T07:27:40Z)
Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems [11.300387488829035]
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by retrieving relevant documents from external corpora before generating responses.<n>Existing poisoning attack strategies treat the retrieval and generation stages as disjointed, limiting their effectiveness.<n>Joint-GCG is the first framework to unify gradient-based attacks across both retriever and generator models.
arXiv Detail & Related papers (2025-06-06T15:12:06Z)
Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models [12.524811181751577]
We propose a novel optimization for adversarial attacks against BCSD models.<n>In particular, we aim to improve the attacks in a challenging scenario, where the attack goal is to limit the model predictions to a specific range.<n>Our attack leverages the superior capability of black-box, model-agnostic explainers in interpreting the model decision boundaries.
arXiv Detail & Related papers (2025-06-05T08:29:19Z)
SafeGenes: Evaluating the Adversarial Robustness of Genomic Foundation Models [8.019763193322298]
We propose SafeGenes: a framework for Secure analysis of genomic foundation models.<n>We assess the adversarial vulnerabilities of GFMs using two approaches: the Fast Gradient Sign Method and a soft prompt attack.<n>Targeted soft prompt attacks led to substantial performance degradation, even in large models such as ESM1b and ESM1v.
arXiv Detail & Related papers (2025-06-01T03:54:03Z)
The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems [101.68501850486179]
We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities.<n>This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top-$k$ candidate set.<n>We propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG.
arXiv Detail & Related papers (2025-05-24T08:19:25Z)
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks [0.0]
Large Language Models (LLMs) are increasingly employed as evaluators (LLM-as-a-Judge) for assessing the quality of machine-generated text.<n>This paper investigates the vulnerability of LLM-as-a-Judge architectures to prompt-injection attacks.
arXiv Detail & Related papers (2025-05-19T16:51:12Z)
Enhancing Adversarial Attacks through Chain of Thought [0.0]
gradient-based adversarial attacks are particularly effective against aligned large language models (LLMs) This paper proposes enhancing the universality of adversarial attacks by integrating CoT prompts with the greedy coordinate gradient (GCG) technique.
arXiv Detail & Related papers (2024-10-29T06:54:00Z)
Relaxing Graph Transformers for Adversarial Attacks [49.450581960551276]
Graph Transformers (GTs) surpassed Message-Passing GNNs on several benchmarks, their adversarial robustness properties are unexplored. We overcome these challenges by targeting three representative architectures based on (1) random-walk PEs, (2) pair-wise-short-paths, and (3) spectral perturbations. Our evaluation reveals that they can be catastrophically fragile and underlines our work's importance and the necessity for adaptive attacks.
arXiv Detail & Related papers (2024-07-16T14:24:58Z)
Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions. We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z)
Guidance Through Surrogate: Towards a Generic Diagnostic Attack [101.36906370355435]
We develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed Guided Projected Gradient Attack (G-PGA) Our modified attack does not require random restarts, large number of attack iterations or search for an optimal step-size. More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
arXiv Detail & Related papers (2022-12-30T18:45:23Z)
Graph Backdoor [53.70971502299977]
We present GTA, the first backdoor attack on graph neural networks (GNNs) GTA departs in significant ways: it defines triggers as specific subgraphs, including both topological structures and descriptive features. It can be instantiated for both transductive (e.g., node classification) and inductive (e.g., graph classification) tasks.
arXiv Detail & Related papers (2020-06-21T19:45:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.