Prosodic Clustering for Phoneme-level Prosody Control in End-to-End
Speech Synthesis
- URL: http://arxiv.org/abs/2111.10177v1
- Date: Fri, 19 Nov 2021 12:10:16 GMT
- Title: Prosodic Clustering for Phoneme-level Prosody Control in End-to-End
Speech Synthesis
- Authors: Alexandra Vioni, Myrsini Christidou, Nikolaos Ellinas, Georgios
Vamvoukakis, Panos Kakoulidis, Taehoon Kim, June Sig Sung, Hyoungmin Park,
Aimilios Chalamandaris, Pirros Tsiakoulis
- Abstract summary: We present a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system.
The proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration.
By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
- Score: 49.6007376399981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a method for controlling the prosody at the phoneme level
in an autoregressive attention-based text-to-speech system. Instead of learning
latent prosodic features with a variational framework as is commonly done, we
directly extract phoneme-level F0 and duration features from the speech data in
the training set. Each prosodic feature is discretized using unsupervised
clustering in order to produce a sequence of prosodic labels for each
utterance. This sequence is used in parallel to the phoneme sequence in order
to condition the decoder with the utilization of a prosodic encoder and a
corresponding attention module. Experimental results show that the proposed
method retains the high quality of generated speech, while allowing
phoneme-level control of F0 and duration. By replacing the F0 cluster centroids
with musical notes, the model can also provide control over the note and octave
within the range of the speaker.
Related papers
- Style Description based Text-to-Speech with Conditional Prosodic Layer
Normalization based Diffusion GAN [17.876323494898536]
We present a Diffusion GAN based approach (Prosodic Diff-TTS) to generate the corresponding high-fidelity speech based on the style description and content text as an input to generate speech samples within only 4 denoising steps.
We demonstrate the efficacy of our proposed architecture on multi-speaker LibriTTS and PromptSpeech datasets, using multiple quantitative metrics that measure generated accuracy and MOS.
arXiv Detail & Related papers (2023-10-27T14:28:41Z) - High-Quality Automatic Voice Over with Accurate Alignment: Supervision
through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction.
Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Singing-Tacotron: Global duration control attention and dynamic filter
for End-to-end singing voice synthesis [67.96138567288197]
This paper proposes an end-to-end singing voice synthesis framework, named Singing-Tacotron.
The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information.
arXiv Detail & Related papers (2022-02-16T07:35:17Z) - Improved Prosodic Clustering for Multispeaker and Speaker-independent
Phoneme-level Prosody Control [48.3671993252296]
This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup.
An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder.
arXiv Detail & Related papers (2021-11-19T11:43:59Z) - SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech
Recognition [36.766303689895686]
This paper proposes a novel framework of Supervised Contrastive Learning (SCaLa) to enhance phonemic information learning for end-to-end ASR systems.
Specifically, we introduce the self-supervised Masked Contrastive Predictive Coding (MCPC) into the fully-supervised setting.
To supervise phoneme learning explicitly, SCaLa first masks the variable-length encoder features corresponding to phonemes given phoneme forced-alignment extracted from a pre-trained acoustic model, and then predicts the masked phonemes via contrastive learning.
arXiv Detail & Related papers (2021-10-08T15:15:38Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.