Quantifying Source Speaker Leakage in One-to-One Voice Conversion
- URL: http://arxiv.org/abs/2504.15822v1
- Date: Tue, 22 Apr 2025 12:09:03 GMT
- Title: Quantifying Source Speaker Leakage in One-to-One Voice Conversion
- Authors: Scott Wellington, Xuechen Liu, Junichi Yamagishi,
- Abstract summary: We show that it is possible to quantify a degree of confidence about a source speaker's identity in the case of one-to-one voice conversion.
- Score: 32.92816245915008
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Using a multi-accented corpus of parallel utterances for use with commercial speech devices, we present a case study to show that it is possible to quantify a degree of confidence about a source speaker's identity in the case of one-to-one voice conversion. Following voice conversion using a HiFi-GAN vocoder, we compare information leakage for a range speaker characteristics; assuming a "worst-case" white-box scenario, we quantify our confidence to perform inference and narrow the pool of likely source speakers, reinforcing the regulatory obligation and moral duty that providers of synthetic voices have to ensure the privacy of their speakers' data.
Related papers
- Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation [66.49076386263509]
This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation.<n>We propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space.<n>UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models.
arXiv Detail & Related papers (2025-01-11T00:47:29Z) - Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Who is Authentic Speaker [4.822108779108675]
Voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes.
It is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly.
This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices.
arXiv Detail & Related papers (2024-04-30T23:41:00Z) - Catch You and I Can: Revealing Source Voiceprint Against Voice
Conversion [0.0]
We make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit.
We develop Revelio, a representation learning model, which learns to effectively extract the voiceprint of the source speaker from converted audio samples.
arXiv Detail & Related papers (2023-02-24T03:33:13Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations.
The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture.
This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z) - Cross-speaker style transfer for text-to-speech using data augmentation [11.686745250628247]
We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion.
We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers.
We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages.
arXiv Detail & Related papers (2022-02-10T15:10:56Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.