Catch You and I Can: Revealing Source Voiceprint Against Voice
Conversion
- URL: http://arxiv.org/abs/2302.12434v1
- Date: Fri, 24 Feb 2023 03:33:13 GMT
- Title: Catch You and I Can: Revealing Source Voiceprint Against Voice
Conversion
- Authors: Jiangyi Deng (1), Yanjiao Chen (1), Yinan Zhong (1), Qianhao Miao (1),
Xueluan Gong (2), Wenyuan Xu (1) ((1) Zhejiang University, (2) Wuhan
University)
- Abstract summary: We make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit.
We develop Revelio, a representation learning model, which learns to effectively extract the voiceprint of the source speaker from converted audio samples.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice conversion (VC) techniques can be abused by malicious parties to
transform their audios to sound like a target speaker, making it hard for a
human being or a speaker verification/identification system to trace the source
speaker. In this paper, we make the first attempt to restore the source
voiceprint from audios synthesized by voice conversion methods with high
credit. However, unveiling the features of the source speaker from a converted
audio is challenging since the voice conversion operation intends to
disentangle the original features and infuse the features of the target
speaker. To fulfill our goal, we develop Revelio, a representation learning
model, which learns to effectively extract the voiceprint of the source speaker
from converted audio samples. We equip Revelio with a carefully-designed
differential rectification algorithm to eliminate the influence of the target
speaker by removing the representation component that is parallel to the
voiceprint of the target speaker. We have conducted extensive experiments to
evaluate the capability of Revelio in restoring voiceprint from audios
converted by VQVC, VQVC+, AGAIN, and BNE. The experiments verify that Revelio
is able to rebuild voiceprints that can be traced to the source speaker by
speaker verification and identification systems. Revelio also exhibits robust
performance under inter-gender conversion, unseen languages, and telephony
networks.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Who is Authentic Speaker [4.822108779108675]
Voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes.
It is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly.
This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices.
arXiv Detail & Related papers (2024-04-30T23:41:00Z) - Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement [41.837538440839815]
We propose an efficient approach, termed as Zero-shot Emotion Style Transfer (ZEST)
The proposed system builds upon decomposing speech into semantic tokens, speaker representations and emotion embeddings.
We show that, even without using parallel training data or labels from the source or target audio, we illustrate zero shot emotion transfer capabilities of the proposed ZEST model.
arXiv Detail & Related papers (2024-01-09T12:10:04Z) - DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion [0.0]
DeID-VC is a speaker de-identification system that converts a real speaker to pseudo speakers.
With the help of PSG, DeID-VC can assign unique pseudo speakers at speaker level or even at utterance level.
arXiv Detail & Related papers (2022-09-09T21:13:08Z) - Are disentangled representations all you need to build speaker
anonymization systems? [0.0]
Speech signals contain a lot of sensitive information, such as the speaker's identity.
Speaker anonymization aims to transform a speech signal to remove the source speaker's identity while leaving the spoken content unchanged.
arXiv Detail & Related papers (2022-08-22T07:51:47Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.