Parametric Representation for Singing Voice Synthesis: a Comparative
Evaluation
- URL: http://arxiv.org/abs/2006.04142v1
- Date: Sun, 7 Jun 2020 13:06:30 GMT
- Title: Parametric Representation for Singing Voice Synthesis: a Comparative
Evaluation
- Authors: Onur Babacan, Thomas Drugman, Tuomo Raitio, Daniel Erro, Thierry
Dutoit
- Abstract summary: The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical parametric synthesis.
The artifacts occurring in high-pitched voices are discussed and possible approaches to overcome them are suggested.
- Score: 10.37199090634032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Various parametric representations have been proposed to model the speech
signal. While the performance of such vocoders is well-known in the context of
speech processing, their extrapolation to singing voice synthesis might not be
straightforward. The goal of this paper is twofold. First, a comparative
subjective evaluation is performed across four existing techniques suitable for
statistical parametric synthesis: traditional pulse vocoder, Deterministic plus
Stochastic Model, Harmonic plus Noise Model and GlottHMM. The behavior of these
techniques as a function of the singer type (baritone, counter-tenor and
soprano) is studied. Secondly, the artifacts occurring in high-pitched voices
are discussed and possible approaches to overcome them are suggested.
Related papers
- Hierarchical Generative Modeling of Melodic Vocal Contours in Hindustani Classical Music [3.491362957652171]
We focus on generative modeling of singers' vocal melodies extracted from audio recordings.
We propose GaMaDHaNi, a modular two-level hierarchy, consisting of a generative model on pitch contours, and a pitch contour to audio synthesis model.
arXiv Detail & Related papers (2024-08-22T18:04:29Z) - End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding [4.604877755214193]
Existing end-to-end piano A2S systems have been trained and evaluated with only synthetic data.
We propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores.
We propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering system on synthetic audio, followed by fine-tuning the model using recordings of human performance.
arXiv Detail & Related papers (2024-05-22T10:52:04Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Towards Improving the Expressiveness of Singing Voice Synthesis with
BERT Derived Semantic Information [51.02264447897833]
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings.
The proposed SVS system can produce singing voice with higher-quality outperforming VISinger.
arXiv Detail & Related papers (2023-08-31T16:12:01Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Speaker Adaption with Intuitive Prosodic Features for Statistical
Parametric Speech Synthesis [50.5027550591763]
We propose a method of speaker adaption with intuitive prosodic features for statistical parametric speech synthesis.
The intuitive prosodic features are extracted at utterance-level or speaker-level, and are further integrated into the existing speaker-encoding-based and speaker-embedding-based adaptation frameworks respectively.
arXiv Detail & Related papers (2022-03-02T09:00:31Z) - Learning Joint Articulatory-Acoustic Representations with Normalizing
Flows [7.183132975698293]
We find a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models.
Our approach achieves both articulatory-to-acoustic as well as acoustic-to-articulatory mapping, thereby demonstrating our success in achieving a joint encoding of both the domains.
arXiv Detail & Related papers (2020-05-16T04:34:36Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.