Transferable Adversarial Attacks against ASR
- URL: http://arxiv.org/abs/2411.09220v1
- Date: Thu, 14 Nov 2024 06:32:31 GMT
- Title: Transferable Adversarial Attacks against ASR
- Authors: Xiaoxue Gao, Zexin Li, Yiming Chen, Cong Liu, Haizhou Li,
- Abstract summary: We study the vulnerability of practical black-box attacks in cutting-edge automatic speech recognition models.
We propose a speech-aware gradient optimization approach (SAGO) for ASR, which forces mistranscription with minimal impact on human imperceptibility.
Our comprehensive experimental results reveal performance enhancements compared to baseline approaches across five models on two databases.
- Score: 43.766547483367795
- License:
- Abstract: Given the extensive research and real-world applications of automatic speech recognition (ASR), ensuring the robustness of ASR models against minor input perturbations becomes a crucial consideration for maintaining their effectiveness in real-time scenarios. Previous explorations into ASR model robustness have predominantly revolved around evaluating accuracy on white-box settings with full access to ASR models. Nevertheless, full ASR model details are often not available in real-world applications. Therefore, evaluating the robustness of black-box ASR models is essential for a comprehensive understanding of ASR model resilience. In this regard, we thoroughly study the vulnerability of practical black-box attacks in cutting-edge ASR models and propose to employ two advanced time-domain-based transferable attacks alongside our differentiable feature extractor. We also propose a speech-aware gradient optimization approach (SAGO) for ASR, which forces mistranscription with minimal impact on human imperceptibility through voice activity detection rule and a speech-aware gradient-oriented optimizer. Our comprehensive experimental results reveal performance enhancements compared to baseline approaches across five models on two databases.
Related papers
- Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network [23.034147003704483]
This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models.
We propose using a front-end adaptation network connected to a frozen ASR model.
Experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages.
arXiv Detail & Related papers (2024-06-27T06:40:01Z) - Exploiting Self-Supervised Constraints in Image Super-Resolution [72.35265021054471]
This paper introduces a novel self-supervised constraint for single image super-resolution, termed SSC-SR.
SSC-SR uniquely addresses the divergence in image complexity by employing a dual asymmetric paradigm and a target model updated via exponential moving average to enhance stability.
Empirical evaluations reveal that our SSC-SR framework delivers substantial enhancements on a variety of benchmark datasets, achieving an average increase of 0.1 dB over EDSR and 0.06 dB over SwinIR.
arXiv Detail & Related papers (2024-03-30T06:18:50Z) - Speech Robust Bench: A Robustness Benchmark For Speech Recognition [20.758654420612793]
Speech Robust Bench (SRB) is a benchmark for evaluating the robustness of ASR models to diverse corruptions.
SRB is composed of 114 input perturbations which simulate an heterogeneous range of corruptions that ASR models may encounter when deployed in the wild.
arXiv Detail & Related papers (2024-03-08T08:10:29Z) - On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion
and Automatic Speech Recognition [6.006652562747009]
We investigate a joint ASR-SER learning approach in a low-resource setting.
Joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively.
Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
arXiv Detail & Related papers (2023-05-21T18:52:21Z) - On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model,
Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews.
We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z) - End-to-End Speech Recognition: A Survey [68.35707678386949]
The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements.
All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
arXiv Detail & Related papers (2023-03-03T01:46:41Z) - Watch What You Pretrain For: Targeted, Transferable Adversarial Examples
on Self-Supervised Speech Recognition models [27.414693266500603]
A targeted adversarial attack produces audio samples that can force an Automatic Speech Recognition system to output attacker-chosen text.
Recent work has shown that transferability against large ASR models is very difficult.
We show that modern ASR architectures, specifically ones based on Self-Supervised Learning, are in fact vulnerable to transferability.
arXiv Detail & Related papers (2022-09-17T15:01:26Z) - Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative
Adversarial Networks [10.723935272906461]
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored.
We introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective.
Our proposed approach outperforms baselines and conventional GAN-based adversarial models.
arXiv Detail & Related papers (2021-03-10T17:40:48Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.