Improved Prosodic Clustering for Multispeaker and Speaker-independent
Phoneme-level Prosody Control
- URL: http://arxiv.org/abs/2111.10168v1
- Date: Fri, 19 Nov 2021 11:43:59 GMT
- Title: Improved Prosodic Clustering for Multispeaker and Speaker-independent
Phoneme-level Prosody Control
- Authors: Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, Georgios
Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung,
Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis
- Abstract summary: This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup.
An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder.
- Score: 48.3671993252296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a method for phoneme-level prosody control of F0 and
duration on a multispeaker text-to-speech setup, which is based on prosodic
clustering. An autoregressive attention-based model is used, incorporating
multispeaker architecture modules in parallel to a prosody encoder. Several
improvements over the basic single-speaker method are proposed that increase
the prosodic control range and coverage. More specifically we employ data
augmentation, F0 normalization, balanced clustering for duration, and
speaker-independent prosodic clustering. These modifications enable
fine-grained phoneme-level prosody control for all speakers contained in the
training set, while maintaining the speaker identity. The model is also
fine-tuned to unseen speakers with limited amounts of data and it is shown to
maintain its prosody control capabilities, verifying that the
speaker-independent prosodic clustering is effective. Experimental results
verify that the model maintains high output speech quality and that the
proposed method allows efficient prosody control within each speaker's range
despite the variability that a multispeaker setting introduces.
Related papers
- Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios [0.9094127664014627]
End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap.
This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities.
arXiv Detail & Related papers (2024-07-01T14:26:28Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations.
The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture.
This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z) - Prosodic Clustering for Phoneme-level Prosody Control in End-to-End
Speech Synthesis [49.6007376399981]
We present a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system.
The proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration.
By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
arXiv Detail & Related papers (2021-11-19T12:10:16Z) - Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model
Selection [25.05285328404576]
optimizing speech towards a particular test-time speaker can improve performance and reduce run-time complexity.
We propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers.
Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined.
arXiv Detail & Related papers (2021-05-08T00:15:57Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.