Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis
- URL: http://arxiv.org/abs/2509.06074v1
- Date: Sun, 07 Sep 2025 14:32:29 GMT
- Title: Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis
- Authors: Zhenqi Jia, Rui Liu, Berrak Sisman, Haizhou Li,
- Abstract summary: Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody.<n>Existing methods overlook the fine-grained semantic and prosodic interaction modeling.<n>We propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system.
- Score: 34.487544170634884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/MFCIG-CSS.
Related papers
- MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z) - TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation [72.46711449668814]
We introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner.<n>We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction, and speech quality.
arXiv Detail & Related papers (2025-12-23T12:04:23Z) - DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models [19.259178812147287]
Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech.<n>We propose DiffCSS, an innovative CSS framework that leverages diffusion models and an LM-based TTS backbone to generate diverse, expressive, and contextually coherent speech.<n> Experimental results demonstrate that the synthesized speech from DiffCSS is more diverse, contextually coherent, and expressive than existing CSS systems.
arXiv Detail & Related papers (2025-02-27T09:53:48Z) - Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis [3.391256280235937]
Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance.<n>The key challenge of CSS is to model the interaction between the MDH and the target utterance.<n>We propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS.
arXiv Detail & Related papers (2024-12-25T01:35:59Z) - Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - Generative Expressive Conversational Speech Synthesis [47.53014375797254]
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting.
We propose a novel generative expressive CSS system, termed GPT-Talker.
We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context.
arXiv Detail & Related papers (2024-07-31T10:02:21Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z) - Discovering Dialog Structure Graph for Open-Domain Dialog Generation [51.29286279366361]
We conduct unsupervised discovery of dialog structure from chitchat corpora.
We then leverage it to facilitate dialog generation in downstream systems.
We present a Discrete Variational Auto-Encoder with Graph Neural Network (DVAE-GNN), to discover a unified human-readable dialog structure.
arXiv Detail & Related papers (2020-12-31T10:58:37Z) - Dialogue Relation Extraction with Document-level Heterogeneous Graph
Attention Networks [21.409522845011907]
Dialogue relation extraction (DRE) aims to detect the relation between two entities mentioned in a multi-party dialogue.
We present a graph attention network-based method for DRE where a graph contains meaningfully connected speaker, entity, entity-type, and utterance nodes.
We empirically show that this graph-based approach quite effectively captures the relations between different entity pairs in a dialogue as it outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2020-09-10T18:51:48Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.