CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning
- URL: http://arxiv.org/abs/2505.19119v1
- Date: Sun, 25 May 2025 12:22:00 GMT
- Title: CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning
- Authors: Renyuan Li, Zhibo Liang, Haichuan Zhang, Tianyu Shi, Zhiyuan Cheng, Jia Shi, Carl Yang, Mingjie Tang,
- Abstract summary: Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns.<n>We introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice cloning.<n>Our method provides protection that is robust across speakers and utterances, without requiring any prior knowledge of the synthesized text.
- Score: 30.85443077082408
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns, allowing highly accurate vocal identity replication from just a few seconds of reference audio, while retaining the speaker's vocal authenticity. In this paper, we introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice cloning. Our method provides protection that is robust across speakers and utterances, without requiring any prior knowledge of the synthesized text. We formulate perturbation generation as a multi-objective optimization problem, and propose Multi-Gradient Descent Algorithm (MGDA) to ensure the robust protection across diverse utterances. To preserve natural auditory perception for users, we decompose the adversarial perturbation via Mel-spectrogram representations and fine-tune it for each sample. This design ensures imperceptibility while maintaining strong degradation effects on zero-shot cloned outputs. Experiments on three state-of-the-art zero-shot TTS systems, five benchmark datasets and evaluations from 60 human listeners demonstrate that our method preserves near-original audio quality in protected inputs (PESQ = 3.90, SRS = 0.93) while substantially degrading both speaker similarity and speech quality in cloned samples (PESQ = 1.07, SRS = 0.08).
Related papers
- Quantum-Inspired Audio Unlearning: Towards Privacy-Preserving Voice Biometrics [44.60499998155848]
QPAudioEraser is a quantum-inspired audio unlearning framework.<n>It consistently surpasses conventional baselines across single-class, multi-class, sequential, and accent-level erasure scenarios.
arXiv Detail & Related papers (2025-07-29T20:12:24Z) - VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning [14.907575859145423]
Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC)<n>DMs have been proven incompatible with proactive defenses due to intricate generative mechanisms of diffusion.<n>We introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading quality in potential unauthorized VC.
arXiv Detail & Related papers (2025-05-18T09:58:48Z) - VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect [2.417762825674103]
rapid advancements in AI voice cloning, fueled by machine learning, have significantly impacted text-to-speech (TTS) and voice conversion (VC) fields.<n>We propose a novel active defense method, VocalCrypt, which embeds pseudo-timbre (jamming information) based on SFS into audio segments that are imperceptible to the human ear.<n>In comparison to existing methods, such as adversarial noise incorporation, VocalCrypt significantly enhances robustness and real-time performance.
arXiv Detail & Related papers (2025-02-14T17:43:01Z) - Mitigating Unauthorized Speech Synthesis for Voice Protection [7.1578783467799]
malicious voice exploitation has brought huge hazards in our daily lives.
It is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints.
We devise Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples.
arXiv Detail & Related papers (2024-10-28T05:16:37Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech [25.707717591185386]
We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality.
All of our code and trained models are available, alongside static and interactive demos.
arXiv Detail & Related papers (2022-06-24T11:54:59Z) - Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z) - Towards Robust Speech-to-Text Adversarial Attack [78.5097679815944]
This paper introduces a novel adversarial algorithm for attacking the state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo.
Our approach is based on developing an extension for the conventional distortion condition of the adversarial optimization formulation.
Minimizing over this metric, which measures the discrepancies between original and adversarial samples' distributions, contributes to crafting signals very close to the subspace of legitimate speech recordings.
arXiv Detail & Related papers (2021-03-15T01:51:41Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.