M2-CTTS: End-to-End Multi-scale Multi-modal Conversational
Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2305.02269v1
- Date: Wed, 3 May 2023 16:59:38 GMT
- Title: M2-CTTS: End-to-End Multi-scale Multi-modal Conversational
Text-to-Speech Synthesis
- Authors: Jinlong Xue, Yayue Deng, Fengping Wang, Ya Li, Yingming Gao, Jianhua
Tao, Jianqing Sun, Jiaen Liang
- Abstract summary: M2-CTTS aims to comprehensively utilize historical conversation and enhance prosodic expression.
We design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling.
- Score: 38.85861825252267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational text-to-speech (TTS) aims to synthesize speech with proper
prosody of reply based on the historical conversation. However, it is still a
challenge to comprehensively model the conversation, and a majority of
conversational TTS systems only focus on extracting global information and omit
local prosody features, which contain important fine-grained information like
keywords and emphasis. Moreover, it is insufficient to only consider the
textual features, and acoustic features also contain various prosody
information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal
conversational text-to-speech system, aiming to comprehensively utilize
historical conversation and enhance prosodic expression. More specifically, we
design a textual context module and an acoustic context module with both
coarse-grained and fine-grained modeling. Experimental results demonstrate that
our model mixed with fine-grained context information and additionally
considering acoustic features achieves better prosody performance and
naturalness in CMOS tests.
Related papers
- MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis [70.06396781553191]
Multimodal Emotional Text-to-Speech System (MM-TTS) is a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
MM-TTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, and the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - Towards Joint Modeling of Dialogue Response and Speech Synthesis based
on Large Language Model [8.180382743037082]
This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously.
arXiv Detail & Related papers (2023-09-20T01:48:27Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - Who says like a style of Vitamin: Towards Syntax-Aware
DialogueSummarization using Multi-task Learning [2.251583286448503]
We focus on the association between utterances from individual speakers and unique syntactic structures.
Speakers have unique textual styles that can contain linguistic information, such as voiceprint.
We employ multi-task learning of both syntax-aware information and dialogue summarization.
arXiv Detail & Related papers (2021-09-29T05:30:39Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.