Towards High-fidelity Singing Voice Conversion with Acoustic Reference
and Contrastive Predictive Coding
- URL: http://arxiv.org/abs/2110.04754v1
- Date: Sun, 10 Oct 2021 10:27:20 GMT
- Title: Towards High-fidelity Singing Voice Conversion with Acoustic Reference
and Contrastive Predictive Coding
- Authors: Chao Wang, Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Yibiao Yu,
Zejun Ma
- Abstract summary: phonetic posteriorgrams based methods have been quite popular in non-parallel singing voice conversion systems.
Due to the lack of acoustic information in PPGs, style and naturalness of the converted singing voices are still limited.
Our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer.
- Score: 6.278338686038089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, phonetic posteriorgrams (PPGs) based methods have been quite
popular in non-parallel singing voice conversion systems. However, due to the
lack of acoustic information in PPGs, style and naturalness of the converted
singing voices are still limited. To solve these problems, in this paper, we
utilize an acoustic reference encoder to implicitly model singing
characteristics. We experiment with different auxiliary features, including mel
spectrograms, HuBERT, and the middle hidden feature (PPG-Mid) of pretrained
automatic speech recognition (ASR) model, as the input of the reference
encoder, and finally find the HuBERT feature is the best choice. In addition,
we use contrastive predictive coding (CPC) module to further smooth the voices
by predicting future observations in latent space. Experiments show that,
compared with the baseline models, our proposed model can significantly improve
the naturalness of converted singing voices and the similarity with the target
singer. Moreover, our proposed model can also make the speakers with just
speech data sing.
Related papers
- Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Towards Improving the Expressiveness of Singing Voice Synthesis with
BERT Derived Semantic Information [51.02264447897833]
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings.
The proposed SVS system can produce singing voice with higher-quality outperforming VISinger.
arXiv Detail & Related papers (2023-08-31T16:12:01Z) - Karaoker: Alignment-free singing voice synthesis with speech training
data [3.9795908407245055]
Karaoker is a multispeaker Tacotron-based model conditioned on voice characteristic features.
The model is jointly conditioned with a single deep convolutional encoder on continuous data.
We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks.
arXiv Detail & Related papers (2022-04-08T15:33:59Z) - Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control [47.33830090185952]
A text-to-rapping/singing system is introduced, which can be adapted to any speaker's voice.
It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data.
Results show that the proposed approach can produce high quality rapping/singing voice with increased naturalness.
arXiv Detail & Related papers (2021-11-17T14:31:55Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - PPG-based singing voice conversion with adversarial representation
learning [18.937609682084034]
Singing voice conversion aims to convert the voice of one singer to that of other singers while keeping the singing content and melody.
We build an end-to-end architecture, taking posteriorgrams as inputs and generating mel spectrograms.
Our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity.
arXiv Detail & Related papers (2020-10-28T08:03:27Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.