Related papers: MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

URL: http://arxiv.org/abs/2509.17143v1
Date: Sun, 21 Sep 2025 16:14:51 GMT
Title: MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances
Authors: Junhyeok Lee, Helin Wang, Yaohan Guan, Thomas Thebaud, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak,
Abstract summary: MaskVCT is a zero-shot voice conversion model that offers multi-factor controllability.<n>The model can leverage continuous or quantized linguistic features to enhance intellgibility and speaker similarity.
Score: 30.21283213138901
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intellgibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.

Related papers

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion [16.19865417052239]
Discl-VC is a novel zero-shot voice conversion framework.<n>It disentangles content and prosody information from self-supervised speech representations.<n>It synthesizes the target speaker's voice through in-context learning.
arXiv Detail & Related papers (2025-05-30T07:04:23Z)
Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching [7.151257248661491]
CTEFM-VC is a framework that integrates content-aware timbre ensemble modeling with conditional flow matching.<n>Experiments show CTEFM-VC consistently achieves the best performance in all metrics assessing speaker similarity, speech naturalness, and intelligibility.
arXiv Detail & Related papers (2024-11-04T12:23:17Z)
ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control [50.27383290553548]
ControlSpeech is a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style.<n>We show that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability.
arXiv Detail & Related papers (2024-06-03T11:15:16Z)
VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC) The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z)
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion [19.74933410443264]
We present an unsupervised many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Our model is trained only with 20 English speakers. It generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion.
arXiv Detail & Related papers (2021-07-21T23:44:17Z)
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. We propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z)
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model. A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise. Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z)
FastVC: Fast Voice Conversion with non-parallel data [13.12834490248018]
This paper introduces FastVC, an end-to-end model for fast Voice Conversion (VC) FastVC is based on a conditional AutoEncoder (AE) trained on non-parallel data and requires no annotations at all. Despite the simple structure of the proposed model, it outperforms the VC Challenge 2020 baselines on the cross-lingual task in terms of naturalness.
arXiv Detail & Related papers (2020-10-08T18:05:30Z)
Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework. It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time. We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.