Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis
- URL: http://arxiv.org/abs/2412.18733v1
- Date: Wed, 25 Dec 2024 01:35:59 GMT
- Title: Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis
- Authors: Zhenqi Jia, Rui Liu,
- Abstract summary: Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance.
The key challenge of CSS is to model the interaction between the MDH and the target utterance.
We propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS.
- Score: 3.391256280235937
- License:
- Abstract: Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and objective experiments on the DailyTalk dataset show that III-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/I3CSS.
Related papers
- Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis [39.25088200618052]
Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style.
Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction.
This knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback.
arXiv Detail & Related papers (2025-01-11T07:43:18Z) - VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [105.88658935310605]
We propose a multi-stage training methodology that progressively trains LLM to understand both visual and speech information.
Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities.
By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities.
arXiv Detail & Related papers (2025-01-03T18:59:52Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.
Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - Generative Expressive Conversational Speech Synthesis [47.53014375797254]
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting.
We propose a novel generative expressive CSS system, termed GPT-Talker.
We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context.
arXiv Detail & Related papers (2024-07-31T10:02:21Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.