Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization
- URL: http://arxiv.org/abs/2503.19591v1
- Date: Tue, 25 Mar 2025 12:14:10 GMT
- Title: Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization
- Authors: Weifei Jin, Junjie Su, Hejia Wang, Yulin Ye, Jie Hao,
- Abstract summary: We propose a technique that aligns adversarial perturbations with low-level acoustic characteristics derived from speech representation models.<n>Our method is plug-and-play and can be integrated with any existing attack methods.
- Score: 4.720552406377147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the widespread application of automatic speech recognition (ASR) systems, their vulnerability to adversarial attacks has been extensively studied. However, most existing adversarial examples are generated on specific individual models, resulting in a lack of transferability. In real-world scenarios, attackers often cannot access detailed information about the target model, making query-based attacks unfeasible. To address this challenge, we propose a technique called Acoustic Representation Optimization that aligns adversarial perturbations with low-level acoustic characteristics derived from speech representation models. Rather than relying on model-specific, higher-layer abstractions, our approach leverages fundamental acoustic representations that remain consistent across diverse ASR architectures. By enforcing an acoustic representation loss to guide perturbations toward these robust, lower-level representations, we enhance the cross-model transferability of adversarial examples without degrading audio quality. Our method is plug-and-play and can be integrated with any existing attack methods. We evaluate our approach on three modern ASR models, and the experimental results demonstrate that our method significantly improves the transferability of adversarial examples generated by previous methods while preserving the audio quality.
Related papers
- $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)
MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.
To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - Adversarial Transferability in Deep Denoising Models: Theoretical Insights and Robustness Enhancement via Out-of-Distribution Typical Set Sampling [6.189440665620872]
Deep learning-based image denoising models demonstrate remarkable performance, but their lack of robustness analysis remains a significant concern.<n>A major issue is that these models are susceptible to adversarial attacks, where small, carefully crafted perturbations to input data can cause them to fail.<n>We propose a novel adversarial defense method: the Out-of-Distribution Typical Set Sampling Training strategy.
arXiv Detail & Related papers (2024-12-08T13:47:57Z) - Robust VAEs via Generating Process of Noise Augmented Data [9.366139389037489]
This paper introduces a novel framework that enhances robustness by regularizing the latent space divergence between original and noise-augmented data.
Our empirical evaluations demonstrate that this approach, termed Robust Augmented Variational Auto-ENcoder (RAVEN), yields superior performance in resisting adversarial inputs.
arXiv Detail & Related papers (2024-07-26T09:55:34Z) - Advancing Test-Time Adaptation in Wild Acoustic Test Settings [26.05732574338255]
Speech signals follow short-term consistency, requiring specialized adaptation strategies.
We propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models.
Our approach outperforms existing baselines under various wild acoustic test settings.
arXiv Detail & Related papers (2023-10-14T06:22:08Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - DDTSE: Discriminative Diffusion Model for Target Speech Extraction [62.422291953387955]
We introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE)
We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods.
We devise a two-stage training strategy to emulate the inference process during model training.
arXiv Detail & Related papers (2023-09-25T04:58:38Z) - Universal Adversarial Defense in Remote Sensing Based on Pre-trained Denoising Diffusion Models [17.283914361697818]
Deep neural networks (DNNs) have risen to prominence as key solutions in numerous AI applications for earth observation (AI4EO)
This paper presents a novel Universal Adversarial Defense approach in Remote Sensing Imagery (UAD-RS)
arXiv Detail & Related papers (2023-07-31T17:21:23Z) - Self-attention fusion for audiovisual emotion recognition with
incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Generalized Real-World Super-Resolution through Adversarial Robustness [107.02188934602802]
We present Robust Super-Resolution, a method that leverages the generalization capability of adversarial attacks to tackle real-world SR.
Our novel framework poses a paradigm shift in the development of real-world SR methods.
By using a single robust model, we outperform state-of-the-art specialized methods on real-world benchmarks.
arXiv Detail & Related papers (2021-08-25T22:43:20Z) - Towards Robust Speech-to-Text Adversarial Attack [78.5097679815944]
This paper introduces a novel adversarial algorithm for attacking the state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo.
Our approach is based on developing an extension for the conventional distortion condition of the adversarial optimization formulation.
Minimizing over this metric, which measures the discrepancies between original and adversarial samples' distributions, contributes to crafting signals very close to the subspace of legitimate speech recordings.
arXiv Detail & Related papers (2021-03-15T01:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.