Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2106.06233v1
- Date: Fri, 11 Jun 2021 08:33:52 GMT
- Title: Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis
- Authors: Jingbei Li, Yi Meng, Chenyi Li, Zhiyong Wu, Helen Meng, Chao Weng and
Dan Su
- Abstract summary: The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
- Score: 59.27994987902646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For conversational text-to-speech (TTS) systems, it is vital that the systems
can adjust the spoken styles of synthesized speech according to different
content and spoken styles in historical conversations. However, the study about
learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which
neglects the spoken styles in historical speeches. Moreover, only the
interactions of the global aspect between speakers are modeled, missing the
party aspect self interactions inside each speaker. In this paper, to achieve
better spoken style learning for conversational TTS, we propose a spoken style
learning approach with multi-modal hierarchical context encoding. The textual
information and spoken styles in the historical conversations are processed
through multiple hierarchical recurrent neural networks to learn the spoken
style related features in global and party aspects. The attention mechanism is
further employed to summarize these features into a conversational context
encoding. Experimental results demonstrate the effectiveness of our proposed
approach, which outperform a baseline method using context encoding learnt only
from the transcripts in global aspects, with MOS score on the naturalness of
synthesized speech increasing from 3.138 to 3.408 and ABX preference rate
exceeding the baseline method by 36.45%.
Related papers
- Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation [27.926862030684926]
We introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation.
Our approach combines pre-trained speech and text models through a specialized encoder and a modal-level mask input.
By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss.
arXiv Detail & Related papers (2023-10-22T11:57:33Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - M2-CTTS: End-to-End Multi-scale Multi-modal Conversational
Text-to-Speech Synthesis [38.85861825252267]
M2-CTTS aims to comprehensively utilize historical conversation and enhance prosodic expression.
We design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling.
arXiv Detail & Related papers (2023-05-03T16:59:38Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z) - End-to-End Text-to-Speech Based on Latent Representation of Speaking
Styles Using Spontaneous Dialogue [19.149834552175076]
This study aims to realize a text-to-speech (TTS) that closely resembles human dialogue.
First, we record and transcribe actual spontaneous dialogues.
Proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS.
arXiv Detail & Related papers (2022-06-24T02:32:12Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - Towards Expressive Speaking Style Modelling with Hierarchical Context
Information for Mandarin Speech Synthesis [37.93814851450597]
We propose a hierarchical framework to model speaking style from context.
A hierarchical context encoder is proposed to explore a wider range of contextual information.
To encourage this encoder to learn style representation better, we introduce a novel training strategy.
arXiv Detail & Related papers (2022-03-23T05:27:57Z) - Who says like a style of Vitamin: Towards Syntax-Aware
DialogueSummarization using Multi-task Learning [2.251583286448503]
We focus on the association between utterances from individual speakers and unique syntactic structures.
Speakers have unique textual styles that can contain linguistic information, such as voiceprint.
We employ multi-task learning of both syntax-aware information and dialogue summarization.
arXiv Detail & Related papers (2021-09-29T05:30:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.