Adversarial Attack on Large Language Models using Exponentiated Gradient Descent
- URL: http://arxiv.org/abs/2505.09820v1
- Date: Wed, 14 May 2025 21:50:46 GMT
- Title: Adversarial Attack on Large Language Models using Exponentiated Gradient Descent
- Authors: Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu,
- Abstract summary: Large Language Models are vulnerable to jailbreaking attacks.<n>We develop an intrinsic optimization technique using exponentiated gradient descent.<n>We show that the technique achieves a higher success rate with great efficiency compared to three other state-of-the-art jailbreaking techniques.
- Score: 1.1187085721899017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) are widely used, understanding them systematically is key to improving their safety and realizing their full potential. Although many models are aligned using techniques such as reinforcement learning from human feedback (RLHF), they are still vulnerable to jailbreaking attacks. Some of the existing adversarial attack methods search for discrete tokens that may jailbreak a target model while others try to optimize the continuous space represented by the tokens of the model's vocabulary. While techniques based on the discrete space may prove to be inefficient, optimization of continuous token embeddings requires projections to produce discrete tokens, which might render them ineffective. To fully utilize the constraints and the structures of the space, we develop an intrinsic optimization technique using exponentiated gradient descent with the Bregman projection method to ensure that the optimized one-hot encoding always stays within the probability simplex. We prove the convergence of the technique and implement an efficient algorithm that is effective in jailbreaking several widely used LLMs. We demonstrate the efficacy of the proposed technique using five open-source LLMs on four openly available datasets. The results show that the technique achieves a higher success rate with great efficiency compared to three other state-of-the-art jailbreaking techniques. The source code for our implementation is available at: https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack
Related papers
- Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models [1.6112718683989882]
We introduce a novel white-box approach for creating adversarial perturbations against LLMs.<n>We first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms.<n>We then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks.
arXiv Detail & Related papers (2025-03-08T16:29:45Z) - Adversarial Attacks on Large Language Models Using Regularized Relaxation [1.042748558542389]
Large Language Models (LLMs) are used for numerous practical applications.
adversarial attack methods are extensively used to study and understand these vulnerabilities.
We propose a novel technique for adversarial attacks that overcomes these limitations by leveraging regularized gradients with continuous optimization methods.
arXiv Detail & Related papers (2024-10-24T21:01:45Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks.
We identify and characterise the important components needed for effective model convergence using gradient descent.
This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z) - Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization [46.98249466236357]
Large language models (LLMs) are susceptible to jailbreaking attacks that can generate harmful content.<n>This paper introduces a novel token-level attack method, Adaptive-to-Sparse Constrained Optimization (ADC), which has been shown to successfully jailbreak multiple open-source LLMs.
arXiv Detail & Related papers (2024-05-15T06:11:24Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Distributed Dynamic Safe Screening Algorithms for Sparse Regularization [73.85961005970222]
We propose a new distributed dynamic safe screening (DDSS) method for sparsity regularized models and apply it on shared-memory and distributed-memory architecture respectively.
We prove that the proposed method achieves the linear convergence rate with lower overall complexity and can eliminate almost all the inactive features in a finite number of iterations almost surely.
arXiv Detail & Related papers (2022-04-23T02:45:55Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.