Related papers: Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

URL: http://arxiv.org/abs/2410.09524v1
Date: Sat, 12 Oct 2024 13:02:31 GMT
Title: Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling
Authors: Rui Liu, Zhenqi Jia, Jie Yang, Yifan Hu, Haizhou Li,
Abstract summary: Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting. We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS. To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
Score: 40.32021786228235
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting, which attracts more attention nowadays. While recognizing the significance of the CTTS task, prior studies have not thoroughly investigated speech emphasis expression, which is essential for conveying the underlying intention and attitude in human-machine interaction scenarios, due to the scarcity of conversational emphasis datasets and the difficulty in context understanding. In this paper, we propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS, that includes two main components: 1) we simultaneously take into account textual and acoustic contexts, with both global and local semantic modeling to understand the conversation context comprehensively; 2) we deeply integrate multi-modal and multi-scale context to learn the influence of context on the emphasis expression of the current utterance. Finally, the inferred emphasis feature is fed into the neural speech synthesizer to generate conversational speech. To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in emphasis rendering within a conversational setting. The code and audio samples are available at https://github.com/CodeStoreTTS/ER-CTTS.

Related papers

Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models [3.8673630752805446]
The aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions.<n>We adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations.<n>Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective.
arXiv Detail & Related papers (2025-06-26T14:14:20Z)
Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation [27.926862030684926]
We introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss.
arXiv Detail & Related papers (2023-10-22T11:57:33Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication. We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z)
M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis [38.85861825252267]
M2-CTTS aims to comprehensively utilize historical conversation and enhance prosodic expression. We design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling.
arXiv Detail & Related papers (2023-05-03T16:59:38Z)
Contextual Expressive Text-to-Speech [25.050361896378533]
We introduce a new task setting, Contextual Text-to-speech (CTTS) The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. We construct a synthetic dataset and develop an effective framework to generate high-quality expressive speech based on the given context.
arXiv Detail & Related papers (2022-11-26T12:06:21Z)
FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context. We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z)
Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy. Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches. We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z)
Topic-Aware Multi-turn Dialogue Modeling [91.52820664879432]
This paper presents a novel solution for multi-turn dialogue modeling, which segments and extracts topic-aware utterances in an unsupervised way. Our topic-aware modeling is implemented by a newly proposed unsupervised topic-aware segmentation algorithm and Topic-Aware Dual-attention Matching (TADAM) Network.
arXiv Detail & Related papers (2020-09-26T08:43:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.