KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using
Mel-spectrograms
- URL: http://arxiv.org/abs/2110.04005v1
- Date: Fri, 8 Oct 2021 10:00:23 GMT
- Title: KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using
Mel-spectrograms
- Authors: Chien-Feng Liao, Jen-Yu Liu, Yi-Hsuan Yang
- Abstract summary: We propose a novel neural network model called KaraSinger for a singing voice synthesis task named score-free SVS.
KaraSinger comprises a vector-quantized variational autoencoder (VQ-VAE) that compresses the Mel-spectrograms of singing audio to sequences of discrete codes, and a language model (LM) that learns to predict the discrete codes given the corresponding lyrics.
We validate the effectiveness of the proposed design choices using a proprietary collection of 550 English pop songs sung by multiple amateur singers.
- Score: 42.59716267275078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel neural network model called KaraSinger for
a less-studied singing voice synthesis (SVS) task named score-free SVS, in
which the prosody and melody are spontaneously decided by machine. KaraSinger
comprises a vector-quantized variational autoencoder (VQ-VAE) that compresses
the Mel-spectrograms of singing audio to sequences of discrete codes, and a
language model (LM) that learns to predict the discrete codes given the
corresponding lyrics. For the VQ-VAE part, we employ a Connectionist Temporal
Classification (CTC) loss to encourage the discrete codes to carry
phoneme-related information. For the LM part, we use location-sensitive
attention for learning a robust alignment between the input phoneme sequence
and the output discrete code. We keep the architecture of both the VQ-VAE and
LM light-weight for fast training and inference speed. We validate the
effectiveness of the proposed design choices using a proprietary collection of
550 English pop songs sung by multiple amateur singers. The result of a
listening test shows that KaraSinger achieves high scores in intelligibility,
musicality, and the overall quality.
Related papers
- Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving [5.572472212662453]
This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves.
We propose an end-to-end system based on graph neural networks that notes that belong to the same chord and connect them with edges if they are part of a voice.
arXiv Detail & Related papers (2024-07-15T14:36:13Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - Karaoker: Alignment-free singing voice synthesis with speech training
data [3.9795908407245055]
Karaoker is a multispeaker Tacotron-based model conditioned on voice characteristic features.
The model is jointly conditioned with a single deep convolutional encoder on continuous data.
We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks.
arXiv Detail & Related papers (2022-04-08T15:33:59Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z) - Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System [25.573552964889963]
This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system.
The proposed system is composed of four modules: a time-lag model, a duration model, an acoustic model, and a vocoder.
Experimental results show our system can synthesize a singing voice with better timing, more natural vibrato, and correct pitch.
arXiv Detail & Related papers (2021-08-05T17:59:58Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Speech-to-Singing Conversion in an Encoder-Decoder Framework [38.111942306157545]
We take a learning based approach to the problem of converting spoken lines into sung ones.
We learn encodings that enable us to synthesize singing that preserves the linguistic content and timbre of the speaker.
arXiv Detail & Related papers (2020-02-16T15:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.