Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for
  Personalized Spontaneous Speech Synthesis
        - URL: http://arxiv.org/abs/2210.07559v2
- Date: Tue, 19 Sep 2023 06:51:38 GMT
- Title: Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for
  Personalized Spontaneous Speech Synthesis
- Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi and Hiroshi
  Saruwatari
- Abstract summary: We focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency.
We develop a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus.
- Score: 35.32703818003108
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract:   We present a comprehensive empirical study for personalized spontaneous
speech synthesis on the basis of linguistic knowledge. With the advent of voice
cloning for reading-style speech synthesis, a new voice cloning paradigm for
human-like and spontaneous speech synthesis is required. We, therefore, focus
on personalized spontaneous speech synthesis that can clone both the
individual's voice timbre and speech disfluency. Specifically, we deal with
filled pauses, a major source of speech disfluency, which is known to play an
important role in speech generation and communication in psychology and
linguistics. To comparatively evaluate personalized filled pause insertion and
non-personalized filled pause prediction methods, we developed a speech
synthesis method with a non-personalized external filled pause predictor
trained with a multi-speaker corpus. The results clarify the position-word
entanglement of filled pauses, i.e., the necessity of precisely predicting
positions for naturalness and the necessity of precisely predicting words for
individuality on the evaluation of synthesized speech.
 
      
        Related papers
        - Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
 We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
 arXiv  Detail & Related papers  (2025-06-26T16:45:20Z)
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous   Behaviors Based on Language Models [55.898594710420326]
 We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
 arXiv  Detail & Related papers  (2024-07-18T13:42:38Z)
- We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker   Embeddings [47.2515056854372]
 In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech.
We propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings.
 arXiv  Detail & Related papers  (2024-07-05T06:54:24Z)
- Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and
  Phoneme Duration for Multi-Speaker Speech Synthesis [16.497022070614236]
 This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker.
A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm.
 arXiv  Detail & Related papers  (2024-02-11T02:26:43Z)
- Towards Spontaneous Style Modeling with Semi-supervised Pre-training for
  Conversational Text-to-Speech Synthesis [53.511443791260206]
 We propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels.
In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech.
 arXiv  Detail & Related papers  (2023-08-31T09:50:33Z)
- EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
  Resynthesis [49.04496602282718]
 We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
 arXiv  Detail & Related papers  (2023-08-10T17:41:19Z)
- Zero-shot personalized lip-to-speech synthesis with face image based
  voice control [41.17483247506426]
 Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
 arXiv  Detail & Related papers  (2023-05-09T02:37:29Z)
- An Overview of Affective Speech Synthesis and Conversion in the Deep
  Learning Era [39.91844543424965]
 Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions.
Following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion.
Deep learning, the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts.
 arXiv  Detail & Related papers  (2022-10-06T13:55:59Z)
- Towards Modelling Coherence in Spoken Discourse [48.80477600384429]
 Coherence in spoken discourse is dependent on the prosodic and acoustic patterns in speech.
We model coherence in spoken discourse with audio-based coherence models.
 arXiv  Detail & Related papers  (2020-12-31T20:18:29Z)
- Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis [37.37319356008348]
 We explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker.
We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings.
We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis.
 arXiv  Detail & Related papers  (2020-05-17T10:29:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.