Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion
- URL: http://arxiv.org/abs/2506.04013v1
- Date: Wed, 04 Jun 2025 14:42:12 GMT
- Title: Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion
- Authors: Seymanur Akti, Tuan Nam Nguyen, Alexander Waibel,
- Abstract summary: Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech.<n>In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder.
- Score: 53.26424100244925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder, focusing on reducing source timbre leakage and improving linguistic-acoustic disentanglement for better style transfer. To minimize style leakage, we use multilingual discrete speech units for content representation and reinforce embeddings with augmentation-based similarity loss and mix-style layer normalization. To enhance expressivity transfer, we incorporate local F0 information via cross-attention and extract style embeddings enriched with global pitch and energy features. Experiments show our model outperforms baselines in emotion and speaker similarity, demonstrating superior style adaptation and reduced source style leakage.
Related papers
- Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z) - ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer [0.0]
We present emphReverBERT, an efficient framework for text-driven speech style transfer.<n>Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features.<n>Experiments on benchmark speech corpora demonstrate that emphReverBERT significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency.
arXiv Detail & Related papers (2025-03-26T21:11:17Z) - AdaptVC: High Quality Voice Conversion with Adaptive Learning [28.25726543043742]
Key challenge is to extract disentangled linguistic content from the source and voice style from the reference.<n>In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters.<n>The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference.
arXiv Detail & Related papers (2025-01-02T16:54:08Z) - Zero-shot Voice Conversion with Diffusion Transformers [0.0]
Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker.
Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks.
We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training.
arXiv Detail & Related papers (2024-11-15T04:43:44Z) - Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling [14.98368067290024]
Takin-VC is a novel expressive zero-shot voice conversion framework.<n>We introduce an innovative hybrid content encoder that incorporates an adaptive fusion module.<n>For timbre modeling, we propose advanced memory-augmented and context-aware modules.
arXiv Detail & Related papers (2024-10-02T09:07:33Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis [37.65745551401636]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
arXiv Detail & Related papers (2023-03-14T08:52:58Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.