Mitigating Unauthorized Speech Synthesis for Voice Protection
- URL: http://arxiv.org/abs/2410.20742v1
- Date: Mon, 28 Oct 2024 05:16:37 GMT
- Title: Mitigating Unauthorized Speech Synthesis for Voice Protection
- Authors: Zhisheng Zhang, Qianyi Yang, Derui Wang, Pengyang Huang, Yuxin Cao, Kai Ye, Jie Hao,
- Abstract summary: malicious voice exploitation has brought huge hazards in our daily lives.
It is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints.
We devise Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples.
- Score: 7.1578783467799
- License:
- Abstract: With just a few speech samples, it is possible to perfectly replicate a speaker's voice in recent years, while malicious voice exploitation (e.g., telecom fraud for illegal financial gain) has brought huge hazards in our daily lives. Therefore, it is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. Most previous defense methods have focused on spoofing speaker verification systems in timbre similarity but the synthesized deepfake speech is still of high quality. In response to the rising hazards, we devise an effective, transferable, and robust proactive protection technology named Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples to prevent them from being effectively learned for text-to-speech (TTS) synthesis models so that high-quality deepfake speeches cannot be generated. We conduct extensive experiments on state-of-the-art (SOTA) TTS models utilizing objective and subjective metrics to comprehensively evaluate our proposed method. The experimental results demonstrate outstanding effectiveness and transferability across various models. Compared to the speech unclarity score of 21.94% from voice synthesizers trained on samples without protection, POP-protected samples significantly increase it to 127.31%. Moreover, our method shows robustness against noise reduction and data augmentation techniques, thereby greatly reducing potential hazards.
Related papers
- VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect [2.417762825674103]
rapid advancements in AI voice cloning, fueled by machine learning, have significantly impacted text-to-speech (TTS) and voice conversion (VC) fields.
We propose a novel active defense method, VocalCrypt, which embeds pseudo-timbre (jamming information) based on SFS into audio segments that are imperceptible to the human ear.
In comparison to existing methods, such as adversarial noise incorporation, VocalCrypt significantly enhances robustness and real-time performance.
arXiv Detail & Related papers (2025-02-14T17:43:01Z) - Can DeepFake Speech be Reliably Detected? [17.10792531439146]
This work presents the first systematic study of active malicious attacks against state-of-the-art open-source speech detectors.
The results highlight the urgent need for more robust detection methods in the face of evolving adversarial threats.
arXiv Detail & Related papers (2024-10-09T06:13:48Z) - HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System [0.9591674293850556]
We propose a framework named HiddenSpeaker, embedding imperceptible perturbations within the training speech samples.
Our results demonstrate that HiddenSpeaker not only deceives the model with unlearnable samples but also enhances the imperceptibility of the perturbations.
arXiv Detail & Related papers (2024-05-24T15:49:00Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Collaborative Watermarking for Adversarial Speech Synthesis [0.0]
We propose a collaborative training scheme for synthetic speech watermarking.
We show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance.
arXiv Detail & Related papers (2023-09-26T19:43:14Z) - Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown
Multi-Class Ensemble of CNNs [1.262949092134022]
Novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it.
The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms.
The method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.
arXiv Detail & Related papers (2023-09-15T04:26:39Z) - Deepfake audio detection by speaker verification [79.99653758293277]
We propose a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations.
The proposed approach can be implemented based on off-the-shelf speaker verification tools.
We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
arXiv Detail & Related papers (2022-09-28T13:46:29Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.