Semi-supervised Learning for Singing Synthesis Timbre
- URL: http://arxiv.org/abs/2011.02809v1
- Date: Thu, 5 Nov 2020 13:33:34 GMT
- Title: Semi-supervised Learning for Singing Synthesis Timbre
- Authors: Jordi Bonada, Merlijn Blaauw
- Abstract summary: We propose a semi-supervised singing synthesizer, which is able to learn new voices from audio data only.
Our system is an encoder-decoder model with two encoders, linguistic and acoustic, and one (acoustic) decoder.
We evaluate our system with a listening test and show that the results are comparable to those obtained with an equivalent supervised approach.
- Score: 22.75251024528604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a semi-supervised singing synthesizer, which is able to learn new
voices from audio data only, without any annotations such as phonetic
segmentation. Our system is an encoder-decoder model with two encoders,
linguistic and acoustic, and one (acoustic) decoder. In a first step, the
system is trained in a supervised manner, using a labelled multi-singer
dataset. Here, we ensure that the embeddings produced by both encoders are
similar, so that we can later use the model with either acoustic or linguistic
input features. To learn a new voice in an unsupervised manner, the pretrained
acoustic encoder is used to train a decoder for the target singer. Finally, at
inference, the pretrained linguistic encoder is used together with the decoder
of the new voice, to produce acoustic features from linguistic input. We
evaluate our system with a listening test and show that the results are
comparable to those obtained with an equivalent supervised approach.
Related papers
- Exploring bat song syllable representations in self-supervised audio encoders [0.0]
We analyze the encoding of bat song syllables in several self-supervised audio encoders.
We find that models pre-trained on human speech generate the most distinctive representations of different syllable types.
arXiv Detail & Related papers (2024-09-19T10:09:31Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Pitch Preservation In Singing Voice Synthesis [6.99674326582747]
This paper presents a novel acoustic model with independent pitch encoder and phoneme encoder, which disentangles the phoneme and pitch information from music score to fully utilize the corpus.
Experimental results indicate that the proposed approaches can characterize intrinsic structure between pitch inputs to obtain better pitch synthesis accuracy and achieve superior singing synthesis performance against the advanced baseline system.
arXiv Detail & Related papers (2021-10-11T07:01:06Z) - Towards High-fidelity Singing Voice Conversion with Acoustic Reference
and Contrastive Predictive Coding [6.278338686038089]
phonetic posteriorgrams based methods have been quite popular in non-parallel singing voice conversion systems.
Due to the lack of acoustic information in PPGs, style and naturalness of the converted singing voices are still limited.
Our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer.
arXiv Detail & Related papers (2021-10-10T10:27:20Z) - An Empirical Study on End-to-End Singing Voice Synthesis with
Encoder-Decoder Architectures [11.440111473570196]
We use encoder-decoder neural models and a number of vocoders to achieve singing voice synthesis.
We conduct experiments to demonstrate that the models can be trained using voice data with pitch information, lyrics and beat information.
arXiv Detail & Related papers (2021-08-06T08:51:16Z) - Collaborative Training of Acoustic Encoders for Speech Recognition [15.200846745937763]
We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition.
We perform experiments using the LibriSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11% relative improvement in the word error rate.
arXiv Detail & Related papers (2021-06-16T17:05:47Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.