Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for
Personalized Spontaneous Speech Synthesis
- URL: http://arxiv.org/abs/2210.07559v2
- Date: Tue, 19 Sep 2023 06:51:38 GMT
- Title: Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for
Personalized Spontaneous Speech Synthesis
- Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi and Hiroshi
Saruwatari
- Abstract summary: We focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency.
We develop a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus.
- Score: 35.32703818003108
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a comprehensive empirical study for personalized spontaneous
speech synthesis on the basis of linguistic knowledge. With the advent of voice
cloning for reading-style speech synthesis, a new voice cloning paradigm for
human-like and spontaneous speech synthesis is required. We, therefore, focus
on personalized spontaneous speech synthesis that can clone both the
individual's voice timbre and speech disfluency. Specifically, we deal with
filled pauses, a major source of speech disfluency, which is known to play an
important role in speech generation and communication in psychology and
linguistics. To comparatively evaluate personalized filled pause insertion and
non-personalized filled pause prediction methods, we developed a speech
synthesis method with a non-personalized external filled pause predictor
trained with a multi-speaker corpus. The results clarify the position-word
entanglement of filled pauses, i.e., the necessity of precisely predicting
positions for naturalness and the necessity of precisely predicting words for
individuality on the evaluation of synthesized speech.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings [47.2515056854372]
In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech.
We propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings.
arXiv Detail & Related papers (2024-07-05T06:54:24Z) - Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and
Phoneme Duration for Multi-Speaker Speech Synthesis [16.497022070614236]
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker.
A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm.
arXiv Detail & Related papers (2024-02-11T02:26:43Z) - Towards Spontaneous Style Modeling with Semi-supervised Pre-training for
Conversational Text-to-Speech Synthesis [53.511443791260206]
We propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels.
In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech.
arXiv Detail & Related papers (2023-08-31T09:50:33Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Zero-shot personalized lip-to-speech synthesis with face image based
voice control [41.17483247506426]
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
arXiv Detail & Related papers (2023-05-09T02:37:29Z) - An Overview of Affective Speech Synthesis and Conversion in the Deep
Learning Era [39.91844543424965]
Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions.
Following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion.
Deep learning, the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts.
arXiv Detail & Related papers (2022-10-06T13:55:59Z) - Towards Modelling Coherence in Spoken Discourse [48.80477600384429]
Coherence in spoken discourse is dependent on the prosodic and acoustic patterns in speech.
We model coherence in spoken discourse with audio-based coherence models.
arXiv Detail & Related papers (2020-12-31T20:18:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.