Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems
- URL: http://arxiv.org/abs/2504.00858v1
- Date: Tue, 01 Apr 2025 14:49:39 GMT
- Title: Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems
- Authors: Weifei Jin, Yuxin Cao, Junjie Su, Derui Wang, Yedi Zhang, Minhui Xue, Jie Hao, Jin Song Dong, Yixian Yang,
- Abstract summary: We propose a novel framework, AudioShield, to protect speech against automatic speech recognition systems.<n>By transferring the perturbations to the latent space, the audio quality is preserved to a large extent.<n> AudioShield shows high effectiveness in real-time end-to-end scenarios, and demonstrates strong resilience against adaptive countermeasures.
- Score: 20.45938874279563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread application of automatic speech recognition (ASR) supports large-scale voice surveillance, raising concerns about privacy among users. In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. While audio adversarial examples have demonstrated the capability to mislead ASR models or evade ASR surveillance, they are typically constructed through time-intensive offline optimization, restricting their practicality in real-time voice communication. Recent work overcame this limitation by generating universal adversarial perturbations (UAPs) and enhancing their transferability for black-box scenarios. However, they introduced excessive noise that significantly degrades audio quality and affects human perception, thereby limiting their effectiveness in practical scenarios. To address this limitation and protect live users' speech against ASR systems, we propose a novel framework, AudioShield. Central to this framework is the concept of Transferable Universal Adversarial Perturbations in the Latent Space (LS-TUAP). By transferring the perturbations to the latent space, the audio quality is preserved to a large extent. Additionally, we propose target feature adaptation to enhance the transferability of UAPs by embedding target text features into the perturbations. Comprehensive evaluation on four commercial ASR APIs (Google, Amazon, iFlytek, and Alibaba), three voice assistants, two LLM-powered ASR and one NN-based ASR demonstrates the protection superiority of AudioShield over existing competitors, and both objective and subjective evaluations indicate that AudioShield significantly improves the audio quality. Moreover, AudioShield also shows high effectiveness in real-time end-to-end scenarios, and demonstrates strong resilience against adaptive countermeasures.
Related papers
- SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis [8.590034271906289]
Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards.
Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation.
We propose a framework, textittextbfSafeSpeech, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches.
arXiv Detail & Related papers (2025-04-14T03:21:23Z) - Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization [59.1277150358203]
We propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos.<n>First, we create preference data via simulating common errors that occurred in AV-ASR from two focals.<n>Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference.
arXiv Detail & Related papers (2024-12-26T00:26:45Z) - Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer [8.948537516293328]
We propose an attack on Automatic Speech Recognition (ASR) systems based on user-customized style transfer.
Our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks.
arXiv Detail & Related papers (2024-05-15T16:05:24Z) - High-Quality Automatic Voice Over with Accurate Alignment: Supervision
through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction.
Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z) - Improving Fairness and Robustness in End-to-End Speech Recognition
through unsupervised clustering [49.069298478971696]
We present a privacy preserving approach to improve fairness and robustness of end-to-end ASR.
We extract utterance level embeddings using a speaker ID model trained on a public dataset.
We use cluster IDs instead of speaker utterance embeddings as extra features during model training.
arXiv Detail & Related papers (2023-06-06T21:13:08Z) - Privacy against Real-Time Speech Emotion Detection via Acoustic
Adversarial Evasion of Machine Learning [7.387631194438338]
DARE-GP is a solution that creates additive noise to mask users' emotional information while preserving the transcription-relevant portions of their speech.
Unlike existing works, DARE-GP provides: a) real-time protection of previously unheard utterances, b) against previously unseen black-box SER classifiers, c) while protecting speech transcription, and d) does so in a realistic, acoustic environment.
arXiv Detail & Related papers (2022-11-17T00:25:05Z) - Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual
Active Speaker Detection [88.74863771919445]
We reveal the vulnerability of AVASD models under audio-only, visual-only, and audio-visual adversarial attacks.
We also propose a novel audio-visual interaction loss (AVIL) for making attackers difficult to find feasible adversarial examples.
arXiv Detail & Related papers (2022-10-03T08:10:12Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Configurable Privacy-Preserving Automatic Speech Recognition [5.730142956540673]
We investigate whether modular automatic speech recognition can improve privacy in voice assistive systems.
We show privacy concerns and the effects of applying various state-of-the-art techniques to each stage of the system.
We argue this presents new opportunities for privacy-preserving applications incorporating ASR.
arXiv Detail & Related papers (2021-04-01T21:03:49Z) - WaveGuard: Understanding and Mitigating Audio Adversarial Examples [12.010555227327743]
We introduce WaveGuard: a framework for detecting adversarial inputs crafted to attack ASR systems.
Our framework incorporates audio transformation functions and analyses the ASR transcriptions of the original and transformed audio to detect adversarial inputs.
arXiv Detail & Related papers (2021-03-04T21:44:37Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.