DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
- URL: http://arxiv.org/abs/2209.04530v1
- Date: Fri, 9 Sep 2022 21:13:08 GMT
- Title: DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
- Authors: Ruibin Yuan, Yuxuan Wu, Jacob Li, Jaxter Kim
- Abstract summary: DeID-VC is a speaker de-identification system that converts a real speaker to pseudo speakers.
With the help of PSG, DeID-VC can assign unique pseudo speakers at speaker level or even at utterance level.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread adoption of speech-based online services raises security and
privacy concerns regarding the data that they use and share. If the data were
compromised, attackers could exploit user speech to bypass speaker verification
systems or even impersonate users. To mitigate this, we propose DeID-VC, a
speaker de-identification system that converts a real speaker to pseudo
speakers, thus removing or obfuscating the speaker-dependent attributes from a
spoken voice. The key components of DeID-VC include a Variational Autoencoder
(VAE) based Pseudo Speaker Generator (PSG) and a voice conversion Autoencoder
(AE) under zero-shot settings. With the help of PSG, DeID-VC can assign unique
pseudo speakers at speaker level or even at utterance level. Also, two novel
learning objectives are added to bridge the gap between training and inference
of zero-shot voice conversion. We present our experimental results with word
error rate (WER) and equal error rate (EER), along with three subjective
metrics to evaluate the generated output of DeID-VC. The result shows that our
method substantially improved intelligibility (WER 10% lower) and
de-identification effectiveness (EER 5% higher) compared to our baseline. Code
and listening demo: https://github.com/a43992899/DeID-VC
Related papers
- Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion [5.483488375189695]
Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style.
Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input.
We present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations.
arXiv Detail & Related papers (2024-09-01T11:51:18Z) - Catch You and I Can: Revealing Source Voiceprint Against Voice
Conversion [0.0]
We make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit.
We develop Revelio, a representation learning model, which learns to effectively extract the voiceprint of the source speaker from converted audio samples.
arXiv Detail & Related papers (2023-02-24T03:33:13Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention
VAE [8.144263449781967]
Variational auto-encoder(VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings.
In this work, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance.
arXiv Detail & Related papers (2022-03-30T03:52:42Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.