Controllable speech synthesis by learning discrete phoneme-level
prosodic representations
- URL: http://arxiv.org/abs/2211.16307v1
- Date: Tue, 29 Nov 2022 15:43:36 GMT
- Title: Controllable speech synthesis by learning discrete phoneme-level
prosodic representations
- Authors: Nikolaos Ellinas, Myrsini Christidou, Alexandra Vioni, June Sig Sung,
Aimilios Chalamandaris, Pirros Tsiakoulis, Paris Mastorocostas
- Abstract summary: We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
- Score: 53.926969174260705
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we present a novel method for phoneme-level prosody control of
F0 and duration using intuitive discrete labels. We propose an unsupervised
prosodic clustering process which is used to discretize phoneme-level F0 and
duration features from a multispeaker speech dataset. These features are fed as
an input sequence of prosodic labels to a prosody encoder module which augments
an autoregressive attention-based text-to-speech model. We utilize various
methods in order to improve prosodic control range and coverage, such as
augmentation, F0 normalization, balanced clustering for duration and
speaker-independent clustering. The final model enables fine-grained
phoneme-level prosody control for all speakers contained in the training set,
while maintaining the speaker identity. Instead of relying on reference
utterances for inference, we introduce a prior prosody encoder which learns the
style of each speaker and enables speech synthesis without the requirement of
reference audio. We also fine-tune the multispeaker model to unseen speakers
with limited amounts of data, as a realistic application scenario and show that
the prosody control capabilities are maintained, verifying that the
speaker-independent prosodic clustering is effective. Experimental results show
that the model has high output speech quality and that the proposed method
allows efficient prosody control within each speaker's range despite the
variability that a multispeaker setting introduces.
Related papers
- Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech [25.707717591185386]
We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality.
All of our code and trained models are available, alongside static and interactive demos.
arXiv Detail & Related papers (2022-06-24T11:54:59Z) - Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations.
The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture.
This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z) - Prosodic Clustering for Phoneme-level Prosody Control in End-to-End
Speech Synthesis [49.6007376399981]
We present a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system.
The proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration.
By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
arXiv Detail & Related papers (2021-11-19T12:10:16Z) - Improved Prosodic Clustering for Multispeaker and Speaker-independent
Phoneme-level Prosody Control [48.3671993252296]
This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup.
An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder.
arXiv Detail & Related papers (2021-11-19T11:43:59Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.