Related papers: "I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models

"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models

URL: http://arxiv.org/abs/2502.00718v1
Date: Sun, 02 Feb 2025 08:36:23 GMT
Title: "I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
Authors: Isha Gupta, David Khachaturov, Robert Mullins,
Abstract summary: This paper explores audio jailbreaks targeting Audio-Language Models (ALMs)<n>We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples.<n>We analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech.
Score: 0.9480364746270077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rise of multimodal large language models has introduced innovative human-machine interaction paradigms but also significant challenges in machine learning safety. Audio-Language Models (ALMs) are especially relevant due to the intuitive nature of spoken communication, yet little is known about their failure modes. This paper explores audio jailbreaks targeting ALMs, focusing on their ability to bypass alignment mechanisms. We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples, demonstrating the first universal jailbreaks in the audio modality, and show that these remain effective in simulated real-world conditions. Beyond demonstrating attack feasibility, we analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech - suggesting that the most effective perturbations for eliciting toxic outputs specifically embed linguistic features within the audio signal. These results have important implications for understanding the interactions between different modalities in multimodal models, and offer actionable insights for enhancing defenses against adversarial audio attacks.

Related papers

Multilingual and Multi-Accent Jailbreaking of Audio LLMs [19.5428160851918]
Multi-AudioJail is the first systematic framework to exploit multilingual and multi-accent audio jailbreaks. We show how acoustic perturbations interact with cross-lingual phonetics to cause jailbreak success rates to surge. We plan to release our dataset to spur research into cross-modal defenses.
arXiv Detail & Related papers (2025-04-01T18:12:23Z)
Exploiting Vulnerabilities in Speech Translation Systems through Targeted Adversarial Attacks [59.87470192277124]
This paper explores methods of compromising speech translation systems through imperceptible audio manipulations. We present two innovative approaches: (1) the injection of perturbation into source audio, and (2) the generation of adversarial music designed to guide targeted translation. Our experiments reveal that carefully crafted audio perturbations can mislead translation models to produce targeted, harmful outputs, while adversarial music achieve this goal more covertly. The implications of this research extend beyond immediate security concerns, shedding light on the interpretability and robustness of neural speech processing systems.
arXiv Detail & Related papers (2025-03-02T16:38:16Z)
Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak [35.62727804915181]
This paper investigates how audio-specific edits influence Large Audio-Language Models (LALMs) inference regarding jailbreak.<n>We introduce the Audio Editing Toolbox (AET), which enables audio-modality edits such as tone adjustment, word emphasis, and noise injection.<n>We also conduct extensive evaluations of state-of-the-art LALMs to assess their robustness under different audio edits.
arXiv Detail & Related papers (2025-01-23T15:51:38Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z)
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models [3.1511847280063696]
Speech enabled foundation models can perform tasks other than automatic speech recognition using an appropriate prompt. With the development of audio-prompted large language models there is the potential for even greater control options. We demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks.
arXiv Detail & Related papers (2024-07-05T13:04:31Z)
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models [5.942307521138583]
We show that special tokens' can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $texttt|endoftext|>$ token. Experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97% of speech samples.
arXiv Detail & Related papers (2024-05-09T22:59:23Z)
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM)<n>USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech.<n>Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z)
Membership Inference Attacks Against Self-supervised Speech Models [62.73937175625953]
Self-supervised learning (SSL) on continuous speech has started gaining attention. We present the first privacy analysis on several SSL speech models using Membership Inference Attacks (MIA) under black-box access.
arXiv Detail & Related papers (2021-11-09T13:00:24Z)
Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.