Contextual Expressive Text-to-Speech
- URL: http://arxiv.org/abs/2211.14548v1
- Date: Sat, 26 Nov 2022 12:06:21 GMT
- Title: Contextual Expressive Text-to-Speech
- Authors: Jianhong Tu, Zeyu Cui, Xiaohuan Zhou, Siqi Zheng, Kai Hu, Ju Fan,
Chang Zhou
- Abstract summary: We introduce a new task setting, Contextual Text-to-speech (CTTS)
The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text.
We construct a synthetic dataset and develop an effective framework to generate high-quality expressive speech based on the given context.
- Score: 25.050361896378533
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of expressive Text-to-speech (TTS) is to synthesize natural speech
with desired content, prosody, emotion, or timbre, in high expressiveness. Most
of previous studies attempt to generate speech from given labels of styles and
emotions, which over-simplifies the problem by classifying styles and emotions
into a fixed number of pre-defined categories. In this paper, we introduce a
new task setting, Contextual TTS (CTTS). The main idea of CTTS is that how a
person speaks depends on the particular context she is in, where the context
can typically be represented as text. Thus, in the CTTS task, we propose to
utilize such context to guide the speech synthesis process instead of relying
on explicit labels of styles and emotions. To achieve this task, we construct a
synthetic dataset and develop an effective framework. Experiments show that our
framework can generate high-quality expressive speech based on the given
context both in synthetic datasets and real-world scenarios.
Related papers
- Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis [3.8251125989631674]
We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system.
It derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech.
Our system showcases competitive inference time performance when benchmarked against state-of-the-art TTS models.
arXiv Detail & Related papers (2024-10-24T23:18:02Z) - Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations [12.891344121936902]
We introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective.
We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc.
The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations.
arXiv Detail & Related papers (2024-04-23T11:41:35Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z) - Self-supervised Context-aware Style Representation for Expressive Speech
Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner.
It leverages an emotion lexicon and uses contrastive learning and deep clustering.
Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.