Related papers: CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

URL: http://arxiv.org/abs/2312.10358v1
Date: Sat, 16 Dec 2023 07:05:16 GMT
Title: CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis
Authors: Yayue Deng, Jinlong Xue, Yukang Jia, Qifei Li, Yichen Han, Fengping Wang, Yingming Gao, Dengfeng Ke, Ya Li
Abstract summary: We introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS. We also introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability.
Score: 14.067804301298498
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.

Related papers

DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models [19.259178812147287]
Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech. We propose DiffCSS, an innovative CSS framework that leverages diffusion models and an LM-based TTS backbone to generate diverse, expressive, and contextually coherent speech. Experimental results demonstrate that the synthesized speech from DiffCSS is more diverse, contextually coherent, and expressive than existing CSS systems.
arXiv Detail & Related papers (2025-02-27T09:53:48Z)
Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis [39.25088200618052]
Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style. Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction. This knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback.
arXiv Detail & Related papers (2025-01-11T07:43:18Z)
Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis [3.391256280235937]
Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. We propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS.
arXiv Detail & Related papers (2024-12-25T01:35:59Z)
Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting. We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS. To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z)
Generative Expressive Conversational Speech Synthesis [47.53014375797254]
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. We propose a novel generative expressive CSS system, termed GPT-Talker. We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context.
arXiv Detail & Related papers (2024-07-31T10:02:21Z)
Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs. We employ domain-adaptive training strategies to help the model adapt to the dialogue domains. Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z)
FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context. We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z)
SPACE-2: Tree-Structured Semi-Supervised Contrastive Pre-training for Task-Oriented Dialog Understanding [68.94808536012371]
We propose a tree-structured pre-trained conversation model, which learns dialog representations from limited labeled dialogs and large-scale unlabeled dialog corpora. Our method can achieve new state-of-the-art results on the DialoGLUE benchmark consisting of seven datasets and four popular dialog understanding tasks.
arXiv Detail & Related papers (2022-09-14T13:42:50Z)
DialAug: Mixing up Dialogue Contexts in Contrastive Learning for Robust Conversational Modeling [3.3578533367912025]
We propose a framework that incorporates augmented versions of a dialogue context into the learning objective. We show that our proposed augmentation method outperforms previous data augmentation approaches.
arXiv Detail & Related papers (2022-04-15T23:39:41Z)
Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis [37.93814851450597]
We propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information. To encourage this encoder to learn style representation better, we introduce a novel training strategy.
arXiv Detail & Related papers (2022-03-23T05:27:57Z)
$C^3$: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z)
Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy. Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches. We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.