Attacking Large Language Models with Projected Gradient Descent
- URL: http://arxiv.org/abs/2402.09154v1
- Date: Wed, 14 Feb 2024 13:13:26 GMT
- Title: Attacking Large Language Models with Projected Gradient Descent
- Authors: Simon Geisler, Tom Wollschl\"ager, M. H. I. Abdalla, Johannes
Gasteiger, Stephan G\"unnemann
- Abstract summary: Projected Gradient Descent (PGD) for adversarial prompts is up to one order of magnitude faster than state-of-the-art discrete optimization.
Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
- Score: 12.130638442765857
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current LLM alignment methods are readily broken through specifically crafted
adversarial prompts. While crafting adversarial prompts using discrete
optimization is highly effective, such attacks typically use more than 100,000
LLM calls. This high computational cost makes them unsuitable for, e.g.,
quantitative analyses and adversarial training. To remedy this, we revisit
Projected Gradient Descent (PGD) on the continuously relaxed input prompt.
Although previous attempts with ordinary gradient-based attacks largely failed,
we show that carefully controlling the error introduced by the continuous
relaxation tremendously boosts their efficacy. Our PGD for LLMs is up to one
order of magnitude faster than state-of-the-art discrete optimization to
achieve the same devastating attack results.
Related papers
- Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection [6.269725911814401]
Large language models (LLMs) are becoming a popular tool as they have significantly advanced in their capability to tackle a wide range of language-based tasks.
However, LLMs applications are highly vulnerable to prompt injection attacks, which poses a critical problem.
This project explores the security vulnerabilities in relation to prompt injection attacks.
arXiv Detail & Related papers (2024-10-28T00:36:21Z) - Adversarial Attacks on Large Language Models Using Regularized Relaxation [1.042748558542389]
Large Language Models (LLMs) are used for numerous practical applications.
adversarial attack methods are extensively used to study and understand these vulnerabilities.
We propose a novel technique for adversarial attacks that overcomes these limitations by leveraging regularized gradients with continuous optimization methods.
arXiv Detail & Related papers (2024-10-24T21:01:45Z) - Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails.
We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses.
C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z) - Boosting Jailbreak Attack with Momentum [5.047814998088682]
Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks.
We introduce the textbfMomentum textbfAccelerated GtextbfCG (textbfMAC) attack, which incorporates a momentum term into the gradient.
arXiv Detail & Related papers (2024-05-02T12:18:14Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Accelerated Convergence for Counterfactual Learning to Rank [65.63997193915257]
We show that convergence rate of SGD approaches with IPS-weighted gradients suffers from the large variance introduced by the IPS weights.
We propose a novel learning algorithm, called CounterSample, that has provably better convergence than standard IPS-weighted gradient descent methods.
We prove that CounterSample converges faster and complement our theoretical findings with empirical results.
arXiv Detail & Related papers (2020-05-21T12:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.