Related papers: VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

URL: http://arxiv.org/abs/2505.12332v2
Date: Wed, 21 May 2025 02:08:03 GMT
Title: VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning
Authors: Qianyue Hu, Junyan Wu, Wei Lu, Xiangyang Luo,
Abstract summary: Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC)<n>DMs have been proven incompatible with proactive defenses due to intricate generative mechanisms of diffusion.<n>We introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading quality in potential unauthorized VC.
Score: 14.907575859145423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning.

Related papers

VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses [3.348046946735795]
Recent defenses attempt to prevent unauthorized cloning by embedding protective perturbations into speech.<n>We proposeVocalBridge, a purification framework that learns a latent mapping from perturbed to clean speech in the EnCodec latent space.<n>We show that our approach consistently outperforms existing purification methods in recovering cloneable voices from protected speech.
arXiv Detail & Related papers (2026-01-05T13:43:30Z)
MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning [18.636738208526676]
MM-Sonate is a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities.<n>To enable zero-shot voice cloning, we introduce a classifier injection mechanism that effectively decouples speaker identity from linguistic content.<n> Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks.
arXiv Detail & Related papers (2026-01-04T15:26:15Z)
DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model [65.93900011975238]
DELULU is a speaker-aware self-supervised foundational model for verification, diarization, and profiling applications.<n>It is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization.<n>Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
arXiv Detail & Related papers (2025-10-20T15:35:55Z)
Impact of Phonetics on Speaker Identity in Adversarial Voice Attack [10.019452425301303]
Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification.<n>We analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions.<n>Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift.
arXiv Detail & Related papers (2025-09-18T21:19:53Z)
De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks [68.41885995006643]
We study the first systematic evaluation of protective perturbations against voice cloning (VC) under realistic threat models.<n>Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models.<n>We propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution.
arXiv Detail & Related papers (2025-07-03T13:30:58Z)
SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs [57.880467106470775]
Attackers can inject imperceptible perturbations into the training data, causing the model to generate malicious, attacker-controlled captions.<n>We propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers.<n>SRD uses a Deep Q-Network to learn policies for applying discrete perturbations to sensitive image regions, aiming to disrupt the activation of malicious pathways.
arXiv Detail & Related papers (2025-06-05T08:22:24Z)
CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning [30.85443077082408]
Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns.<n>We introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice cloning.<n>Our method provides protection that is robust across speakers and utterances, without requiring any prior knowledge of the synthesized text.
arXiv Detail & Related papers (2025-05-25T12:22:00Z)
VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect [2.417762825674103]
rapid advancements in AI voice cloning, fueled by machine learning, have significantly impacted text-to-speech (TTS) and voice conversion (VC) fields.<n>We propose a novel active defense method, VocalCrypt, which embeds pseudo-timbre (jamming information) based on SFS into audio segments that are imperceptible to the human ear.<n>In comparison to existing methods, such as adversarial noise incorporation, VocalCrypt significantly enhances robustness and real-time performance.
arXiv Detail & Related papers (2025-02-14T17:43:01Z)
GenVC: Self-Supervised Zero-Shot Voice Conversion [31.94758615908198]
GenVC is a novel framework that disentangles speaker identity and linguistic content from speech signals in a self-supervised manner.<n>Results show that GenVC achieves notably higher speaker similarity, with naturalness on par with leading zero-shot approaches.
arXiv Detail & Related papers (2025-02-06T21:40:09Z)
SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention [24.842378497026154]
SEF-VC is a speaker embedding free voice conversion model. It learns and incorporates speaker timbre from reference speech via a powerful position-agnostic cross-attention mechanism. It reconstructs waveform from HuBERT semantic tokens in a non-autoregressive manner.
arXiv Detail & Related papers (2023-12-14T06:26:55Z)
Controllable speech synthesis by learning discrete phoneme-level prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z)
DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion [0.0]
DeID-VC is a speaker de-identification system that converts a real speaker to pseudo speakers. With the help of PSG, DeID-VC can assign unique pseudo speakers at speaker level or even at utterance level.
arXiv Detail & Related papers (2022-09-09T21:13:08Z)
Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
FoolHD: Fooling speaker identification by Highly imperceptible adversarial Disturbances [63.80959552818541]
We propose a white-box steganography-inspired adversarial attack that generates imperceptible perturbations against a speaker identification model. Our approach, FoolHD, uses a Gated Convolutional Autoencoder that operates in the DCT domain and is trained with a multi-objective loss function. We validate FoolHD with a 250-speaker identification x-vector network, trained using VoxCeleb, in terms of accuracy, success rate, and imperceptibility.
arXiv Detail & Related papers (2020-11-17T07:38:26Z)
Speaker De-identification System using Autoencoders and Adversarial Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders. Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z)
FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0. FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.