Related papers: SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models

SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models

URL: http://arxiv.org/abs/2601.16231v1
Date: Tue, 20 Jan 2026 18:53:29 GMT
Title: SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models
Authors: Aafiya Hussain, Gaurav Srivastava, Alvi Ishmam, Zaber Hakim, Chris Thomas,
Abstract summary: Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks.<n>We study a realistic and underexplored threat model: audio-only adversarial attacks on trimodal audio-video-language models.<n>We show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate.
Score: 1.7424550973815194
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS <= 0.08, SI-SNR >= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving >97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.

Related papers

Breaking Audio Large Language Models by Attacking Only the Encoder: A Universal Targeted Latent-Space Audio Attack [0.0]
We propose a universal targeted latent space attack on audio-language models.<n>Our approach learns a universal perturbation that generalizes across inputs and speakers and does not require access to the language model.
arXiv Detail & Related papers (2025-12-29T21:56:13Z)
Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations [0.0]
Multimodal large language models (MLLMs) have achieved remarkable progress, yet remain critically vulnerable to adversarial attacks.<n>We present a systematic study of multimodal jailbreaks targeting both vision-language and audio-language models.<n>Our evaluation spans 1,900 adversarial prompts across three high-risk safety categories.
arXiv Detail & Related papers (2025-10-23T05:16:33Z)
Backdoor Attacks Against Speech Language Models [63.07317091368079]
We present the first systematic study of audio backdoor attacks against speech language models.<n>We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks.<n>We propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
arXiv Detail & Related papers (2025-10-01T17:45:04Z)
Measuring the Robustness of Audio Deepfake Detectors [59.09338266364506]
This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions.<n>Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations.
arXiv Detail & Related papers (2025-03-21T23:21:17Z)
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models [0.9480364746270077]
This paper explores audio jailbreaks targeting Audio-Language Models (ALMs)<n>We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples.<n>We analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech.
arXiv Detail & Related papers (2025-02-02T08:36:23Z)
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z)
Multi-granular Adversarial Attacks against Black-box Neural Ranking Models [111.58315434849047]
We create high-quality adversarial examples by incorporating multi-granular perturbations. We transform the multi-granular attack into a sequential decision-making process. Our attack method surpasses prevailing baselines in both attack effectiveness and imperceptibility.
arXiv Detail & Related papers (2024-04-02T02:08:29Z)
Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection [88.74863771919445]
We reveal the vulnerability of AVASD models under audio-only, visual-only, and audio-visual adversarial attacks. We also propose a novel audio-visual interaction loss (AVIL) for making attackers difficult to find feasible adversarial examples.
arXiv Detail & Related papers (2022-10-03T08:10:12Z)
Can audio-visual integration strengthen robustness under multimodal attacks? [47.791552254215745]
We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model.
arXiv Detail & Related papers (2021-04-05T16:46:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.