Karaoker: Alignment-free singing voice synthesis with speech training
data
- URL: http://arxiv.org/abs/2204.04127v1
- Date: Fri, 8 Apr 2022 15:33:59 GMT
- Title: Karaoker: Alignment-free singing voice synthesis with speech training
data
- Authors: Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos
Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios
Chalamandaris
- Abstract summary: Karaoker is a multispeaker Tacotron-based model conditioned on voice characteristic features.
The model is jointly conditioned with a single deep convolutional encoder on continuous data.
We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks.
- Score: 3.9795908407245055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing singing voice synthesis models (SVS) are usually trained on singing
data and depend on either error-prone time-alignment and duration features or
explicit music score information. In this paper, we propose Karaoker, a
multispeaker Tacotron-based model conditioned on voice characteristic features
that is trained exclusively on spoken data without requiring time-alignments.
Karaoker synthesizes singing voice following a multi-dimensional template
extracted from a source waveform of an unseen speaker/singer. The model is
jointly conditioned with a single deep convolutional encoder on continuous data
including pitch, intensity, harmonicity, formants, cepstral peak prominence and
octaves. We extend the text-to-speech training objective with feature
reconstruction, classification and speaker identification tasks that guide the
model to an accurate result. Except for multi-tasking, we also employ a
Wasserstein GAN training scheme as well as new losses on the acoustic model's
output to further refine the quality of the model.
Related papers
- Low-Resource Cross-Domain Singing Voice Synthesis via Reduced
Self-Supervised Speech Representations [41.410556997285326]
Karaoker-SSL is a singing voice synthesis model that is trained only on text and speech data.
It does not utilize any singing data end-to-end, since its vocoder is also trained on speech data.
arXiv Detail & Related papers (2024-02-02T16:06:24Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - A Melody-Unsupervision Model for Singing Voice Synthesis [9.137554315375919]
We propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time.
We show that the proposed model is capable of being trained with speech audio and text labels but can generate singing voice in inference time.
arXiv Detail & Related papers (2021-10-13T07:42:35Z) - Towards High-fidelity Singing Voice Conversion with Acoustic Reference
and Contrastive Predictive Coding [6.278338686038089]
phonetic posteriorgrams based methods have been quite popular in non-parallel singing voice conversion systems.
Due to the lack of acoustic information in PPGs, style and naturalness of the converted singing voices are still limited.
Our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer.
arXiv Detail & Related papers (2021-10-10T10:27:20Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.