Speaker Adaption with Intuitive Prosodic Features for Statistical
Parametric Speech Synthesis
- URL: http://arxiv.org/abs/2203.00951v1
- Date: Wed, 2 Mar 2022 09:00:31 GMT
- Title: Speaker Adaption with Intuitive Prosodic Features for Statistical
Parametric Speech Synthesis
- Authors: Pengyu Cheng and Zhenhua Ling
- Abstract summary: We propose a method of speaker adaption with intuitive prosodic features for statistical parametric speech synthesis.
The intuitive prosodic features are extracted at utterance-level or speaker-level, and are further integrated into the existing speaker-encoding-based and speaker-embedding-based adaptation frameworks respectively.
- Score: 50.5027550591763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a method of speaker adaption with intuitive
prosodic features for statistical parametric speech synthesis. The intuitive
prosodic features employed in this method include pitch, pitch range, speech
rate and energy considering that they are directly related with the overall
prosodic characteristics of different speakers. The intuitive prosodic features
are extracted at utterance-level or speaker-level, and are further integrated
into the existing speaker-encoding-based and speaker-embedding-based adaptation
frameworks respectively. The acoustic models are sequence-to-sequence ones
based on Tacotron2. Intuitive prosodic features are concatenated with text
encoder outputs and speaker vectors for decoding acoustic features.Experimental
results have demonstrated that our proposed methods can achieve better
objective and subjective performance than the baseline methods without
intuitive prosodic features. Besides, the proposed speaker adaption method with
utterance-level prosodic features has achieved the best similarity of synthetic
speech among all compared methods.
Related papers
- Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and
Phoneme Duration for Multi-Speaker Speech Synthesis [16.497022070614236]
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker.
A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm.
arXiv Detail & Related papers (2024-02-11T02:26:43Z) - ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis [5.824018496599849]
We propose a novel method for modeling numerous speakers.
It enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model.
arXiv Detail & Related papers (2023-11-20T13:13:24Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - Improving multi-speaker TTS prosody variance with a residual encoder and
normalizing flows [9.515272632173884]
Disentanglement of speaker id and prosody is crucial in text-to-speech systems to improve on naturalness and produce more variable syntheses.
This paper proposes a new neural text-to-speech model that approaches the disentanglement problem by conditioning a Tacotron2-like architecture on flow-normalized speaker embeddings.
arXiv Detail & Related papers (2021-06-10T14:08:42Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.