Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking
- URL: http://arxiv.org/abs/2507.04446v2
- Date: Wed, 09 Jul 2025 11:52:25 GMT
- Title: Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking
- Authors: Tim Beyer, Yan Scholten, Leo Schwinn, Stephan Günnemann,
- Abstract summary: Existing adversarial attacks typically target harmful responses in single-point, greedy generations.<n>We propose a novel framework for adversarial evaluation that explicitly models the entire output distribution, including tail-risks.<n>Our framework also enables us to analyze how different attack algorithms affect output harm distributions.
- Score: 44.8238758047607
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point, greedy generations, overlooking the inherently stochastic nature of LLMs. In this paper, we propose a novel framework for adversarial robustness evaluation that explicitly models the entire output distribution, including tail-risks, providing better estimates for model robustness at scale. By casting the attack process as a resource allocation problem between optimization and sampling, we determine compute-optimal tradeoffs and show that integrating sampling into existing attacks boosts ASR by up to 48% and improves efficiency by up to two orders of magnitude. Our framework also enables us to analyze how different attack algorithms affect output harm distributions. Surprisingly, we find that most optimization strategies have little effect on output harmfulness. Finally, we introduce a data-free proof-of-concept objective based on entropy-maximization to demonstrate how our tail-aware perspective enables new optimization targets. Overall, our findings highlight the importance of tail-aware attacks and evaluation protocols to accurately assess and strengthen LLM safety.
Related papers
- Leveraging Robust Optimization for LLM Alignment under Distribution Shifts [52.983390470606146]
Preference alignment methods are increasingly critical for steering large language models to generate outputs consistent with human values.<n>We propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts.
arXiv Detail & Related papers (2025-04-08T09:14:38Z) - Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs)
We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback.
Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z) - Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.<n>We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.<n>We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.<n>This paper presents an innovative defensive strategy, given white box access to an LLM.<n>We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - Adaptive importance sampling for heavy-tailed distributions via
$\alpha$-divergence minimization [2.879807093604632]
We propose an AIS algorithm that approximates the target by Student-t proposal distributions.
We adapt location and scale parameters by matching the escort moments of the target and the proposal.
These updates minimize the $alpha$-divergence between the target and the proposal, thereby connecting with variational inference.
arXiv Detail & Related papers (2023-10-25T14:07:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.