ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks
- URL: http://arxiv.org/abs/2508.17660v1
- Date: Mon, 25 Aug 2025 04:46:35 GMT
- Title: ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks
- Authors: Yuanda Wang, Bocheng Chen, Hanqing Guo, Guangjing Wang, Weikang Ding, Qiben Yan,
- Abstract summary: We propose ClearMask, a noise-free defense mechanism against voice deepfake attacks.<n>Unlike traditional approaches, ClearMask modifies the audio mel-spectrogram by selectively filtering certain frequencies.<n>We also develop LiveMask to protect streaming speech in real-time through a universal frequency filter and reverberation generator.
- Score: 10.688345955363973
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice deepfake attacks, which artificially impersonate human speech for malicious purposes, have emerged as a severe threat. Existing defenses typically inject noise into human speech to compromise voice encoders in speech synthesis models. However, these methods degrade audio quality and require prior knowledge of the attack approaches, limiting their effectiveness in diverse scenarios. Moreover, real-time audios, such as speech in virtual meetings and voice messages, are still exposed to voice deepfake threats. To overcome these limitations, we propose ClearMask, a noise-free defense mechanism against voice deepfake attacks. Unlike traditional approaches, ClearMask modifies the audio mel-spectrogram by selectively filtering certain frequencies, inducing a transferable voice feature loss without injecting noise. We then apply audio style transfer to further deceive voice decoders while preserving perceived sound quality. Finally, optimized reverberation is introduced to disrupt the output of voice generation models without affecting the naturalness of the speech. Additionally, we develop LiveMask to protect streaming speech in real-time through a universal frequency filter and reverberation generator. Our experimental results show that ClearMask and LiveMask effectively prevent voice deepfake attacks from deceiving speaker verification models and human listeners, even for unseen voice synthesis models and black-box API services. Furthermore, ClearMask demonstrates resilience against adaptive attackers who attempt to recover the original audio signal from the protected speech samples.
Related papers
- VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings [2.5925656171325127]
We propose VoxMorph, a framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining.<n> VoxMorph achieves state-of-the-art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems.
arXiv Detail & Related papers (2026-01-27T19:45:18Z) - FreeTalk:A plug-and-play and black-box defense against speech synthesis attacks [40.22853425929116]
We propose a lightweight, robust, plug-and-play privacy preservation method against speech synthesis attacks.<n>Our method generates and adds a frequency-domain perturbation to the original speech to achieve privacy protection and high speech quality.
arXiv Detail & Related papers (2025-08-30T17:10:22Z) - On the Generation and Removal of Speaker Adversarial Perturbation for Voice-Privacy Protection [45.49915832081347]
Recent development in voice-privacy protection has shown the positive use cases of the same technique to conceal speaker's voice attribute.<n>This paper examines the reversibility property where an entity generating adversarial perturbations is authorized to remove them and restore original speech.<n>A similar technique could also be used by an investigator to deanonymize a voice-protected speech to restore criminals' identities in security and forensic analysis.
arXiv Detail & Related papers (2024-12-12T11:46:07Z) - Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models [5.942307521138583]
We show that special tokens' can be exploited by adversarial attacks to manipulate the model's behavior.
We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $texttt|endoftext|>$ token.
Experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97% of speech samples.
arXiv Detail & Related papers (2024-05-09T22:59:23Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection [54.74467470358476]
This paper proposes a dataset for scene fake audio detection named SceneFake.
A manipulated audio is generated by only tampering with the acoustic scene of an original audio.
Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper.
arXiv Detail & Related papers (2022-11-11T09:05:50Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Audio Adversarial Examples: Attacks Using Vocal Masks [0.0]
We construct audio adversarial examples on automatic Speech-To-Text systems.
We produce an another by overlaying an audio vocal mask generated from the original audio.
We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and CMUSphinx.
arXiv Detail & Related papers (2021-02-04T05:21:10Z) - Cortical Features for Defense Against Adversarial Audio Attacks [55.61885805423492]
We propose using a computational model of the auditory cortex as a defense against adversarial attacks on audio.
We show that the cortical features help defend against universal adversarial examples.
arXiv Detail & Related papers (2021-01-30T21:21:46Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.