Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust
Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation
- URL: http://arxiv.org/abs/2311.04693v1
- Date: Wed, 8 Nov 2023 14:02:53 GMT
- Title: Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust
Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation
- Authors: Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee
- Abstract summary: We introduce Diff-HierVC, a hierarchical VC system based on two diffusion models.
Our model achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.
- Score: 41.98697872087318
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although voice conversion (VC) systems have shown a remarkable ability to
transfer voice style, existing methods still have an inaccurate pitch and low
speaker adaptation quality. To address these challenges, we introduce
Diff-HierVC, a hierarchical VC system based on two diffusion models. We first
introduce DiffPitch, which can effectively generate F0 with the target voice
style. Subsequently, the generated F0 is fed to DiffVoice to convert the speech
with a target voice style. Furthermore, using the source-filter encoder, we
disentangle the speech and use the converted Mel-spectrogram as a data-driven
prior in DiffVoice to improve the voice style transfer capacity. Finally, by
using the masked prior in diffusion models, our model can improve the speaker
adaptation quality. Experimental results verify the superiority of our model in
pitch generation and voice style transfer performance, and our model also
achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.
Related papers
- Taming Data and Transformers for Audio Generation [49.54707963286065]
AutoCap is a high-quality and efficient automatic audio captioning model.
GenAu is a scalable transformer-based audio generation architecture.
We compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset.
arXiv Detail & Related papers (2024-06-27T17:58:54Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer [25.966328901566815]
We present HierVST, a hierarchical adaptive end-to-end zero-shot VST model.
Without any text transcripts, we only use the speech dataset to train the model.
With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively.
arXiv Detail & Related papers (2023-07-30T08:49:55Z) - DisC-VC: Disentangled and F0-Controllable Neural Voice Conversion [17.83563578034567]
We propose a new variational-autoencoder-based voice conversion model accompanied by an auxiliary network.
We show the effectiveness of the proposed method by objective and subjective evaluations.
arXiv Detail & Related papers (2022-10-20T07:30:07Z) - StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource
Contexts [32.170748231414365]
To be useful in a wider range of contexts, voice conversion systems need to be trainable without access to parallel data.
This paper extends recent voice conversion models based on generative adversarial networks (GANs)
We show that real-time zero-shot voice conversion is possible even for a model trained on very little data.
arXiv Detail & Related papers (2021-05-31T18:21:28Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - F0-consistent many-to-many non-parallel voice conversion via conditional
autoencoder [53.901873501494606]
We modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time.
We can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
arXiv Detail & Related papers (2020-04-15T22:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.