Related papers: Zero-shot Voice Conversion with Diffusion Transformers

Zero-shot Voice Conversion with Diffusion Transformers

URL: http://arxiv.org/abs/2411.09943v1
Date: Fri, 15 Nov 2024 04:43:44 GMT
Title: Zero-shot Voice Conversion with Diffusion Transformers
Authors: Songting Liu,
Abstract summary: Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.

Related papers

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion [53.26424100244925]
Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech.<n>In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder.
arXiv Detail & Related papers (2025-06-04T14:42:12Z)
Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion [16.19865417052239]
Discl-VC is a novel zero-shot voice conversion framework.<n>It disentangles content and prosody information from self-supervised speech representations.<n>It synthesizes the target speaker's voice through in-context learning.
arXiv Detail & Related papers (2025-05-30T07:04:23Z)
VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching [0.7366405857677227]
VoicePrompter is a robust zero-shot voice conversion model that leverages in-context learning with voice prompts. We show that VoicePrompter outperforms existing zero-shot VC systems in terms of speaker similarity, speech intelligibility, and audio quality.
arXiv Detail & Related papers (2025-01-29T12:34:58Z)
SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations [12.423959479216895]
One-shot voice conversion (VC) is a method that enables the transformation between any two speakers using only a single target speaker utterance. Recent works utilizing K-means quantization (KQ) with self-supervised learning (SSL) features have proven capable of capturing content information from speech. We propose a simple yet effective one-shot VC model that utilizes the characteristics of SSL features and speech attributes.
arXiv Detail & Related papers (2024-11-25T07:14:26Z)
Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity. Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent. This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z)
SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention [24.842378497026154]
SEF-VC is a speaker embedding free voice conversion model. It learns and incorporates speaker timbre from reference speech via a powerful position-agnostic cross-attention mechanism. It reconstructs waveform from HuBERT semantic tokens in a non-autoregressive manner.
arXiv Detail & Related papers (2023-12-14T06:26:55Z)
Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion [42.43123253495082]
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
arXiv Detail & Related papers (2022-08-18T10:36:27Z)
Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning. A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z)
Toward Degradation-Robust Voice Conversion [94.60503904292916]
Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training. It is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations. We report in this paper the first comprehensive study on the degradation of robustness of any-to-any voice conversion.
arXiv Detail & Related papers (2021-10-14T17:00:34Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. We propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z)
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning. A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.