Towards Expressive Speaking Style Modelling with Hierarchical Context
Information for Mandarin Speech Synthesis
- URL: http://arxiv.org/abs/2203.12201v1
- Date: Wed, 23 Mar 2022 05:27:57 GMT
- Title: Towards Expressive Speaking Style Modelling with Hierarchical Context
Information for Mandarin Speech Synthesis
- Authors: Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, Helen
Meng
- Abstract summary: We propose a hierarchical framework to model speaking style from context.
A hierarchical context encoder is proposed to explore a wider range of contextual information.
To encourage this encoder to learn style representation better, we introduce a novel training strategy.
- Score: 37.93814851450597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous works on expressive speech synthesis mainly focus on current
sentence. The context in adjacent sentences is neglected, resulting in
inflexible speaking style for the same text, which lacks speech variations. In
this paper, we propose a hierarchical framework to model speaking style from
context. A hierarchical context encoder is proposed to explore a wider range of
contextual information considering structural relationship in context,
including inter-phrase and inter-sentence relations. Moreover, to encourage
this encoder to learn style representation better, we introduce a novel
training strategy with knowledge distillation, which provides the target for
encoder training. Both objective and subjective evaluations on a Mandarin
lecture dataset demonstrate that the proposed method can significantly improve
the naturalness and expressiveness of the synthesized speech.
Related papers
- Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - Generative Adversarial Training for Text-to-Speech Synthesis Based on
Raw Phonetic Input and Explicit Prosody Modelling [0.36868085124383626]
We describe an end-to-end speech synthesis system that uses generative adversarial training.
We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling.
arXiv Detail & Related papers (2023-10-14T18:15:51Z) - Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics.
We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context.
Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z) - FCTalker: Fine and Coarse Grained Context Modeling for Expressive
Conversational Speech Synthesis [75.74906149219817]
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context.
We propose a novel expressive conversational TTS model, as termed FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation.
arXiv Detail & Related papers (2022-10-27T12:20:20Z) - Improve Discourse Dependency Parsing with Contextualized Representations [28.916249926065273]
We propose to take advantage of transformers to encode contextualized representations of units of different levels.
Motivated by the observation of writing patterns commonly shared across articles, we propose a novel method that treats discourse relation identification as a sequence labelling task.
arXiv Detail & Related papers (2022-05-04T14:35:38Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.