RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations
- URL: http://arxiv.org/abs/2505.12686v1
- Date: Mon, 19 May 2025 04:14:58 GMT
- Title: RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations
- Authors: Seungmin Kim, Sohee Park, Donghyun Kim, Jisu Lee, Daeseon Choi,
- Abstract summary: We propose RoVo, a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals.<n>RoVo effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models.<n>A user study confirmed that RoVo preserves both naturalness and usability of protected speech.
- Score: 5.777711921986914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhancement methods. To overcome this limitation, we propose RoVo (Robust Voice), a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, reconstructing them into protected speech. This approach effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models, which represent a secondary attack threat. In extensive experiments, RoVo increased the Defense Success Rate (DSR) by over 70% compared to unprotected speech, across four state-of-the-art speech synthesis models. Specifically, RoVo achieved a DSR of 99.5% on a commercial speaker-verification API, effectively neutralizing speech synthesis attack. Moreover, RoVo's perturbations remained robust even under strong speech enhancement conditions, outperforming traditional methods. A user study confirmed that RoVo preserves both naturalness and usability of protected speech, highlighting its effectiveness in complex and evolving threat scenarios.
Related papers
- De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks [68.41885995006643]
We study the first systematic evaluation of protective perturbations against voice cloning (VC) under realistic threat models.<n>Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models.<n>We propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution.
arXiv Detail & Related papers (2025-07-03T13:30:58Z) - SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis [8.590034271906289]
Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards.<n>Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation.<n>We propose a framework, textittextbfSafeSpeech, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches.
arXiv Detail & Related papers (2025-04-14T03:21:23Z) - VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect [2.417762825674103]
rapid advancements in AI voice cloning, fueled by machine learning, have significantly impacted text-to-speech (TTS) and voice conversion (VC) fields.<n>We propose a novel active defense method, VocalCrypt, which embeds pseudo-timbre (jamming information) based on SFS into audio segments that are imperceptible to the human ear.<n>In comparison to existing methods, such as adversarial noise incorporation, VocalCrypt significantly enhances robustness and real-time performance.
arXiv Detail & Related papers (2025-02-14T17:43:01Z) - Mitigating Unauthorized Speech Synthesis for Voice Protection [7.1578783467799]
malicious voice exploitation has brought huge hazards in our daily lives.
It is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints.
We devise Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples.
arXiv Detail & Related papers (2024-10-28T05:16:37Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Mel Frequency Spectral Domain Defenses against Adversarial Attacks on
Speech Recognition Systems [33.21836814000979]
This paper explores speech specific defenses using the mel spectral domain, and introduces a novel defense method called'mel domain noise flooding' (MDNF)
MDNF applies additive noise to the mel spectrogram of a speech utterance prior to re-synthesising the audio signal.
We test the defenses against strong white-box adversarial attacks such as projected gradient descent (PGD) and Carlini-Wagner (CW) attacks.
arXiv Detail & Related papers (2022-03-29T06:58:26Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Perceptual-based deep-learning denoiser as a defense against adversarial
attacks on ASR systems [26.519207339530478]
Adversarial attacks attempt to force misclassification by adding small perturbations to the original speech signal.
We propose to counteract this by employing a neural-network based denoiser as a pre-processor in the ASR pipeline.
We found that training the denoisier using a perceptually motivated loss function resulted in increased adversarial robustness.
arXiv Detail & Related papers (2021-07-12T07:00:06Z) - Towards Robust Speech-to-Text Adversarial Attack [78.5097679815944]
This paper introduces a novel adversarial algorithm for attacking the state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo.
Our approach is based on developing an extension for the conventional distortion condition of the adversarial optimization formulation.
Minimizing over this metric, which measures the discrepancies between original and adversarial samples' distributions, contributes to crafting signals very close to the subspace of legitimate speech recordings.
arXiv Detail & Related papers (2021-03-15T01:51:41Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.