Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control
- URL: http://arxiv.org/abs/2111.09146v1
- Date: Wed, 17 Nov 2021 14:31:55 GMT
- Title: Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control
- Authors: Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini
Christidou, Panos Kakoulidis, Georgios Vamvoukakis, Georgia Maniati, June Sig
Sung, Hyoungmin Park, Pirros Tsiakoulis and Aimilios Chalamandaris
- Abstract summary: A text-to-rapping/singing system is introduced, which can be adapted to any speaker's voice.
It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data.
Results show that the proposed approach can produce high quality rapping/singing voice with increased naturalness.
- Score: 47.33830090185952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, a text-to-rapping/singing system is introduced, which can be
adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker
acoustic model trained on read-only speech data and which provides prosody
control at the phoneme level. Dataset augmentation and additional prosody
manipulation based on traditional DSP algorithms are also investigated. The
neural TTS model is fine-tuned to an unseen speaker's limited recordings,
allowing rapping/singing synthesis with the target's speaker voice. The
detailed pipeline of the system is described, which includes the extraction of
the target pitch and duration values from an a capella song and their
conversion into target speaker's valid range of notes before synthesis. An
additional stage of prosodic manipulation of the output via WSOLA is also
investigated for better matching the target duration values. The synthesized
utterances can be mixed with an instrumental accompaniment track to produce a
complete song. The proposed system is evaluated via subjective listening tests
as well as in comparison to an available alternate system which also aims to
produce synthetic singing voice from read-only training data. Results show that
the proposed approach can produce high quality rapping/singing voice with
increased naturalness.
Related papers
- MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance [14.22941848955693]
MakeSinger is a semi-supervised training method for singing voice synthesis.
Our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step.
We demonstrate that by adding Text-to-Speech (TTS) data in training, the model can synthesize the singing voices of TTS speakers even without their singing voices.
arXiv Detail & Related papers (2024-06-10T01:47:52Z) - Creative Text-to-Audio Generation via Synthesizer Programming [1.1203110769488043]
We propose a text-to-audio generation method that leverages a virtual modular sound synthesizer with only 78 parameters.
Our method, CTAG, iteratively updates a synthesizer's parameters to produce high-quality audio renderings of text prompts.
arXiv Detail & Related papers (2024-06-01T04:08:31Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Differentiable WORLD Synthesizer-based Neural Vocoder With Application
To End-To-End Audio Style Transfer [6.29475963948119]
We propose a differentiable WORLD synthesizer and demonstrate its use in end-to-end audio style transfer tasks.
Our baseline differentiable synthesizer has no model parameters, yet it yields adequate quality synthesis.
An alternative differentiable approach considers extraction of the source spectrum directly, which can improve naturalness.
arXiv Detail & Related papers (2022-08-15T15:48:36Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.