CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate
Prosody in Conversational Speech Synthesis
- URL: http://arxiv.org/abs/2312.10358v1
- Date: Sat, 16 Dec 2023 07:05:16 GMT
- Title: CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate
Prosody in Conversational Speech Synthesis
- Authors: Yayue Deng, Jinlong Xue, Yukang Jia, Qifei Li, Yichen Han, Fengping
Wang, Yingming Gao, Dengfeng Ke, Ya Li
- Abstract summary: We introduce a contrastive learning-based CSS framework, CONCSS.
Within this framework, we define an innovative pretext task specific to CSS.
We also introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability.
- Score: 14.067804301298498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational speech synthesis (CSS) incorporates historical dialogue as
supplementary information with the aim of generating speech that has
dialogue-appropriate prosody. While previous methods have already delved into
enhancing context comprehension, context representation still lacks effective
representation capabilities and context-sensitive discriminability. In this
paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within
this framework, we define an innovative pretext task specific to CSS that
enables the model to perform self-supervised learning on unlabeled
conversational datasets to boost the model's context understanding.
Additionally, we introduce a sampling strategy for negative sample augmentation
to enhance context vectors' discriminability. This is the first attempt to
integrate contrastive learning into CSS. We conduct ablation studies on
different contrastive learning strategies and comprehensive experiments in
comparison with prior CSS systems. Results demonstrate that the synthesized
speech from our proposed method exhibits more contextually appropriate and
sensitive prosody.
Related papers
- Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis [39.25088200618052]
Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style.
Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction.
This knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback.
arXiv Detail & Related papers (2025-01-11T07:43:18Z) - Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis [3.391256280235937]
Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance.
The key challenge of CSS is to model the interaction between the MDH and the target utterance.
We propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS.
arXiv Detail & Related papers (2024-12-25T01:35:59Z) - Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - Generative Expressive Conversational Speech Synthesis [47.53014375797254]
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting.
We propose a novel generative expressive CSS system, termed GPT-Talker.
We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context.
arXiv Detail & Related papers (2024-07-31T10:02:21Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z) - SPACE-2: Tree-Structured Semi-Supervised Contrastive Pre-training for
Task-Oriented Dialog Understanding [68.94808536012371]
We propose a tree-structured pre-trained conversation model, which learns dialog representations from limited labeled dialogs and large-scale unlabeled dialog corpora.
Our method can achieve new state-of-the-art results on the DialoGLUE benchmark consisting of seven datasets and four popular dialog understanding tasks.
arXiv Detail & Related papers (2022-09-14T13:42:50Z) - Towards Expressive Speaking Style Modelling with Hierarchical Context
Information for Mandarin Speech Synthesis [37.93814851450597]
We propose a hierarchical framework to model speaking style from context.
A hierarchical context encoder is proposed to explore a wider range of contextual information.
To encourage this encoder to learn style representation better, we introduce a novel training strategy.
arXiv Detail & Related papers (2022-03-23T05:27:57Z) - $C^3$: Compositional Counterfactual Contrastive Learning for
Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context.
Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available.
We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.