FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis
- URL: http://arxiv.org/abs/2210.15360v1
- Date: Thu, 27 Oct 2022 12:20:20 GMT
- Title: FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis
- Authors: Yifan Hu, Rui Liu, Guanglai Gao, Haizhou Li
- Abstract summary: Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
- Score: 75.74906149219817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the
right linguistic and affective prosody in a conversational context. The
correlation between the current utterance and the dialogue history at the
utterance level was used to improve the expressiveness of synthesized speech.
However, the fine-grained information in the dialogue history at the word level
also has an important impact on the prosodic expression of an utterance, which
has not been well studied in the prior work. Therefore, we propose a novel
expressive conversational TTS model, termed as FCTalker, that learn the fine
and coarse grained context dependency at the same time during speech
generation. Specifically, the FCTalker includes fine and coarse grained
encoders to exploit the word and utterance-level context dependency. To model
the word-level dependencies between an utterance and its dialogue history, the
fine-grained dialogue encoder is built on top of a dialogue BERT model. The
experimental results show that the proposed method outperforms all baselines
and generates more expressive speech that is contextually appropriate. We
release the source code at: https://github.com/walker-hyf/FCTalker.
Related papers
- Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation [16.724603503894166]
Style-Talker is an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation.
Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence.
arXiv Detail & Related papers (2024-08-13T04:35:11Z) - Generative Expressive Conversational Speech Synthesis [47.53014375797254]
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting.
We propose a novel generative expressive CSS system, termed GPT-Talker.
We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context.
arXiv Detail & Related papers (2024-07-31T10:02:21Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis
Using Linguistic and Prosodic Contexts of Dialogue History [38.65020349874135]
We propose an end-to-end empathetic dialogue speech synthesis (DSS) model.
Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context.
To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling.
arXiv Detail & Related papers (2022-06-16T09:47:25Z) - Towards Expressive Speaking Style Modelling with Hierarchical Context
Information for Mandarin Speech Synthesis [37.93814851450597]
We propose a hierarchical framework to model speaking style from context.
A hierarchical context encoder is proposed to explore a wider range of contextual information.
To encourage this encoder to learn style representation better, we introduce a novel training strategy.
arXiv Detail & Related papers (2022-03-23T05:27:57Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z) - DialogBERT: Discourse-Aware Response Generation via Learning to Recover
and Rank Utterances [18.199473005335093]
This paper presents DialogBERT, a novel conversational response generation model that enhances previous PLM-based dialogue models.
To efficiently capture the discourse-level coherence among utterances, we propose two training objectives, including masked utterance regression.
Experiments on three multi-turn conversation datasets show that our approach remarkably outperforms the baselines.
arXiv Detail & Related papers (2020-12-03T09:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.