Related papers: VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses

VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses

URL: http://arxiv.org/abs/2601.02444v1
Date: Mon, 05 Jan 2026 13:43:30 GMT
Title: VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses
Authors: Maryam Abbasihafshejani, AHM Nazmus Sakib, Murtuza Jadliwala,
Abstract summary: Recent defenses attempt to prevent unauthorized cloning by embedding protective perturbations into speech.<n>We proposeVocalBridge, a purification framework that learns a latent mapping from perturbed to clean speech in the EnCodec latent space.<n>We show that our approach consistently outperforms existing purification methods in recovering cloneable voices from protected speech.
Score: 3.348046946735795
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The rapid advancement of speech synthesis technologies, including text-to-speech (TTS) and voice conversion (VC), has intensified security and privacy concerns related to voice cloning. Recent defenses attempt to prevent unauthorized cloning by embedding protective perturbations into speech to obscure speaker identity while maintaining intelligibility. However, adversaries can apply advanced purification techniques to remove these perturbations, recover authentic acoustic characteristics, and regenerate cloneable voices. Despite the growing realism of such attacks, the robustness of existing defenses under adaptive purification remains insufficiently studied. Most existing purification methods are designed to counter adversarial noise in automatic speech recognition (ASR) systems rather than speaker verification or voice cloning pipelines. As a result, they fail to suppress the fine-grained acoustic cues that define speaker identity and are often ineffective against speaker verification attacks (SVA). To address these limitations, we propose Diffusion-Bridge (VocalBridge), a purification framework that learns a latent mapping from perturbed to clean speech in the EnCodec latent space. Using a time-conditioned 1D U-Net with a cosine noise schedule, the model enables efficient, transcript-free purification while preserving speaker-discriminative structure. We further introduce a Whisper-guided phoneme variant that incorporates lightweight temporal guidance without requiring ground-truth transcripts. Experimental results show that our approach consistently outperforms existing purification methods in recovering cloneable voices from protected speech. Our findings demonstrate the fragility of current perturbation-based defenses and highlight the need for more robust protection mechanisms against evolving voice-cloning and speaker verification threats.

Related papers

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models [51.7170633585748]
Stream-Voice-Anon adapts modern causal LM-based NAC architectures specifically for streaming speaker anonymization.<n>Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies.<n>Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility.
arXiv Detail & Related papers (2026-01-20T13:23:44Z)
Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings [52.985061676464554]
We propose a Knowledge Distillation based training approach for short context speaker embedding extraction.<n>We leverage the spatial information of the speaker of interest using beamforming to reduce overlap.<n>Results demonstrate that our models are effective at short-context embedding extraction and more robust to overlap.
arXiv Detail & Related papers (2025-08-18T11:32:13Z)
De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks [68.41885995006643]
We study the first systematic evaluation of protective perturbations against voice cloning (VC) under realistic threat models.<n>Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models.<n>We propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution.
arXiv Detail & Related papers (2025-07-03T13:30:58Z)
VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning [14.907575859145423]
Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC)<n>DMs have been proven incompatible with proactive defenses due to intricate generative mechanisms of diffusion.<n>We introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading quality in potential unauthorized VC.
arXiv Detail & Related papers (2025-05-18T09:58:48Z)
SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis [8.590034271906289]
Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards.<n>Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation.<n>We propose a framework, textittextbfSafeSpeech, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches.
arXiv Detail & Related papers (2025-04-14T03:21:23Z)
VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect [2.417762825674103]
rapid advancements in AI voice cloning, fueled by machine learning, have significantly impacted text-to-speech (TTS) and voice conversion (VC) fields.<n>We propose a novel active defense method, VocalCrypt, which embeds pseudo-timbre (jamming information) based on SFS into audio segments that are imperceptible to the human ear.<n>In comparison to existing methods, such as adversarial noise incorporation, VocalCrypt significantly enhances robustness and real-time performance.
arXiv Detail & Related papers (2025-02-14T17:43:01Z)
Mitigating Unauthorized Speech Synthesis for Voice Protection [7.1578783467799]
malicious voice exploitation has brought huge hazards in our daily lives. It is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. We devise Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples.
arXiv Detail & Related papers (2024-10-28T05:16:37Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
Speaker Identity Preservation in Dysarthric Speech Reconstruction by Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech. Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity. We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z)
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. We propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z)
High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.