Singing-Tacotron: Global duration control attention and dynamic filter
for End-to-end singing voice synthesis
- URL: http://arxiv.org/abs/2202.07907v1
- Date: Wed, 16 Feb 2022 07:35:17 GMT
- Title: Singing-Tacotron: Global duration control attention and dynamic filter
for End-to-end singing voice synthesis
- Authors: Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen
- Abstract summary: This paper proposes an end-to-end singing voice synthesis framework, named Singing-Tacotron.
The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information.
- Score: 67.96138567288197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end singing voice synthesis (SVS) is attractive due to the avoidance
of pre-aligned data. However, the auto learned alignment of singing voice with
lyrics is difficult to match the duration information in musical score, which
will lead to the model instability or even failure to synthesize voice. To
learn accurate alignment information automatically, this paper proposes an
end-to-end SVS framework, named Singing-Tacotron. The main difference between
the proposed framework and Tacotron is that the speech can be controlled
significantly by the musical score's duration information. Firstly, we propose
a global duration control attention mechanism for the SVS model. The attention
mechanism can control each phoneme's duration. Secondly, a duration encoder is
proposed to learn a set of global transition tokens from the musical score.
These transition tokens can help the attention mechanism decide whether moving
to the next phoneme or staying at each decoding step. Thirdly, to further
improve the model's stability, a dynamic filter is designed to help the model
overcome noise interference and pay more attention to local context
information. Subjective and objective evaluation verify the effectiveness of
the method. Furthermore, the role of global transition tokens and the effect of
duration control are explored. Examples of experiments can be found at
https://hairuo55.github.io/SingingTacotron.
Related papers
- Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual
Voice Conversion [2.3443118032034396]
This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing.
Our model outperforms existing state-of-the-art results in both subjective and objective evaluations.
arXiv Detail & Related papers (2023-10-10T11:50:16Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Singing voice synthesis based on frame-level sequence-to-sequence models
considering vocal timing deviation [15.185681242504467]
singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation.
In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account differences between actual vocal timing and note start timing.
arXiv Detail & Related papers (2023-01-05T19:00:10Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Prosodic Clustering for Phoneme-level Prosody Control in End-to-End
Speech Synthesis [49.6007376399981]
We present a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system.
The proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration.
By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
arXiv Detail & Related papers (2021-11-19T12:10:16Z) - Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control [47.33830090185952]
A text-to-rapping/singing system is introduced, which can be adapted to any speaker's voice.
It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data.
Results show that the proposed approach can produce high quality rapping/singing voice with increased naturalness.
arXiv Detail & Related papers (2021-11-17T14:31:55Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.