Enhancing Adversarial Attacks through Chain of Thought
- URL: http://arxiv.org/abs/2410.21791v1
- Date: Tue, 29 Oct 2024 06:54:00 GMT
- Title: Enhancing Adversarial Attacks through Chain of Thought
- Authors: Jingbo Su,
- Abstract summary: gradient-based adversarial attacks are particularly effective against aligned large language models (LLMs)
This paper proposes enhancing the universality of adversarial attacks by integrating CoT prompts with the greedy coordinate gradient (GCG) technique.
- Score: 0.0
- License:
- Abstract: Large language models (LLMs) have demonstrated impressive performance across various domains but remain susceptible to safety concerns. Prior research indicates that gradient-based adversarial attacks are particularly effective against aligned LLMs and the chain of thought (CoT) prompting can elicit desired answers through step-by-step reasoning. This paper proposes enhancing the robustness of adversarial attacks on aligned LLMs by integrating CoT prompts with the greedy coordinate gradient (GCG) technique. Using CoT triggers instead of affirmative targets stimulates the reasoning abilities of backend LLMs, thereby improving the transferability and universality of adversarial attacks. We conducted an ablation study comparing our CoT-GCG approach with Amazon Web Services auto-cot. Results revealed our approach outperformed both the baseline GCG attack and CoT prompting. Additionally, we used Llama Guard to evaluate potentially harmful interactions, providing a more objective risk assessment of entire conversations compared to matching outputs to rejection phrases. The code of this paper is available at https://github.com/sujingbo0217/CS222W24-LLM-Attack.
Related papers
- Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models [16.938267820586024]
We propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG.
Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost.
arXiv Detail & Related papers (2024-10-20T11:27:41Z) - AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation [42.797865918373326]
We study the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks.
We introduce an enhanced method that manipulates models' attention scores to facilitate jailbreaking.
Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs.
arXiv Detail & Related papers (2024-10-11T17:55:09Z) - Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation [49.480978190805125]
Transfer attacks generate significant interest for black-box applications.
Existing works essentially directly optimize the single-level objective w.r.t. surrogate model.
We propose a bilevel optimization paradigm, which explicitly reforms the nested relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker.
arXiv Detail & Related papers (2024-06-04T07:45:27Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs.
In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks.
We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z) - TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models [16.71019302192829]
Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP)
Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized.
We propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-05-22T07:21:32Z) - Boosting Jailbreak Attack with Momentum [5.047814998088682]
Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks.
We introduce the textbfMomentum textbfAccelerated GtextbfCG (textbfMAC) attack, which incorporates a momentum term into the gradient.
arXiv Detail & Related papers (2024-05-02T12:18:14Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Certifying LLM Safety against Adversarial Prompting [75.19953634352258]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.
We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.