Related papers: AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

URL: http://arxiv.org/abs/2412.08608v1
Date: Wed, 11 Dec 2024 18:30:57 GMT
Title: AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
Authors: Mintong Kang, Chejian Xu, Bo Li,
Abstract summary: Recent advancements in large audio-guided models (LALMs) have enabled speech-based user interactions.<n>However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or AI regulations.
Score: 13.807596637437808
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in large audio-language models (LALMs) have enabled speech-based user interactions, significantly enhancing user experience and accelerating the deployment of LALMs in real-world applications. However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or violate AI regulations. Despite the importance of this issue, research on jailbreaking LALMs remains limited due to their recent emergence and the additional technical challenges they present compared to attacks on DNN-based audio models. Specifically, the audio encoders in LALMs, which involve discretization operations, often lead to gradient shattering, hindering the effectiveness of attacks relying on gradient-based optimizations. The behavioral variability of LALMs further complicates the identification of effective (adversarial) optimization targets. Moreover, enforcing stealthiness constraints on adversarial audio waveforms introduces a reduced, non-convex feasible solution space, further intensifying the challenges of the optimization process. To overcome these challenges, we develop AdvWave, the first jailbreak framework against LALMs. We propose a dual-phase optimization method that addresses gradient shattering, enabling effective end-to-end gradient-based optimization. Additionally, we develop an adaptive adversarial target search algorithm that dynamically adjusts the adversarial optimization target based on the response patterns of LALMs for specific queries. To ensure that adversarial audio remains perceptually natural to human listeners, we design a classifier-guided optimization approach that generates adversarial noise resembling common urban sounds. Extensive evaluations on multiple advanced LALMs demonstrate that AdvWave outperforms baseline methods, achieving a 40% higher average jailbreak attack success rate.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
Scores Know Bobs Voice: Speaker Impersonation Attack [8.404098071525473]
We propose an inversion-based generative attack framework that aligns the latent space of a synthesis model with the discriminative feature space of SRSs.<n>Experiments show that our method significantly improves query efficiency, achieving competitive attack success rates with on average 10x fewer queries than prior approaches.
arXiv Detail & Related papers (2026-03-03T09:20:48Z)
AsynDBT: Asynchronous Distributed Bilevel Tuning for efficient In-Context Learning with Large Language Models [4.4866154758274375]
In-context learning (ICL) has emerged as a promising paradigm that enables LLMs to adapt to new tasks using examples provided within the input.<n>Previous FL approaches that incorporate ICL have struggled with severe straggler problems and challenges associated with heterogeneous non-identically data.<n>We propose an asynchronous distributed bilevel tuning (AsynDBT) algorithm that optimize both in-context learning samples and prompt fragments based on the feedback from the LLM.
arXiv Detail & Related papers (2026-02-06T13:07:49Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
Rethinking Prompt Optimization: Reinforcement, Diversification, and Migration in Blackbox LLMs [10.434732630519377]
We propose a novel Automatic Prompt Optimization (APO) framework centered on enhancing the feedback mechanism.<n>To mitigate the noise inherent in LLM-generated feedback, we introduce a technique called feedback diversification.<n>Our approach consistently outperforms strong baselines, achieving significant accuracy improvements, faster convergence, and lower computational costs.
arXiv Detail & Related papers (2025-07-14T00:20:14Z)
Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission [87.68447072141402]
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers.<n>We propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL)
arXiv Detail & Related papers (2025-06-30T02:56:11Z)
Semantic-Preserving Adversarial Attacks on LLMs: An Adaptive Greedy Binary Search Approach [15.658579092368981]
Large Language Models (LLMs) increasingly rely on automatic prompt engineering in graphical user interfaces (GUIs) to refine user inputs and enhance response accuracy.<n>We propose the Adaptive Greedy Binary Search (AGBS) method, which simulates common prompt optimization mechanisms while preserving semantic stability.
arXiv Detail & Related papers (2025-05-26T15:41:06Z)
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models [50.89022445197919]
Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs)<n>Recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment.
arXiv Detail & Related papers (2025-05-26T08:25:25Z)
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs [13.432303050813864]
We introduce LARGO, a novel latent self-reflection attack that generates fluent jailbreaking prompts.<n>On benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate.
arXiv Detail & Related papers (2025-05-16T04:12:16Z)
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development. We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents. We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z)
PEARL: Towards Permutation-Resilient LLMs [29.55886726376898]
In-context learning (ICL) capability of large language models (LLMs) enables them to perform challenging tasks using provided demonstrations. ICL is highly sensitive to the ordering of demonstrations, leading to instability in predictions. This paper shows that this vulnerability can be exploited to design a natural attack that achieves nearly 80% success rate on LLaMA-3.
arXiv Detail & Related papers (2025-02-20T15:07:02Z)
CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning [3.602902292270654]
Reinforcement learning (RL) in continuous action spaces encounters persistent challenges, such as inefficient exploration and convergence to suboptimal solutions. We propose CAMEL, a novel framework integrating LLM-generated suboptimal policies into the RL training pipeline.
arXiv Detail & Related papers (2025-02-17T15:22:19Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs [3.096869664709865]
We introduce Generative Adversarial Suffix Prompter (GASP) to improve adversarial suffix creation in a fully black-box setting. Our experiments show that GASP can generate natural jailbreak prompts, significantly improving attack success rates, reducing training times, and accelerating inference speed.
arXiv Detail & Related papers (2024-11-21T14:00:01Z)
Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum. We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z)
QROA: A Black-Box Query-Response Optimization Attack on LLMs [2.7624021966289605]
Large Language Models (LLMs) have surged in popularity in recent months, yet they possess capabilities for generating harmful content when manipulated. This study introduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction.
arXiv Detail & Related papers (2024-06-04T07:27:36Z)
LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning [69.95292905263393]
We show that gradient-based and high-level LLMs can effectively collaborate a combined optimization framework.<n>In this paper, we show that these complementary to each other and can effectively collaborate a combined optimization framework.
arXiv Detail & Related papers (2024-05-30T06:24:14Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns. We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z)
Optimization-based Prompt Injection Attack to LLM-as-a-Judge [78.20257854455562]
LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. We propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. Our evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks.
arXiv Detail & Related papers (2024-03-26T13:58:00Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.