Text-driven Emotional Style Control and Cross-speaker Style Transfer in
Neural TTS
- URL: http://arxiv.org/abs/2207.06000v1
- Date: Wed, 13 Jul 2022 07:05:44 GMT
- Title: Text-driven Emotional Style Control and Cross-speaker Style Transfer in
Neural TTS
- Authors: Yookyung Shin, Younggun Lee, Suhee Jo, Yeongtae Hwang, Taesu Kim
- Abstract summary: Style control of synthetic speech is often restricted to discrete emotion categories.
We propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS.
- Score: 7.384726530165295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Expressive text-to-speech has shown improved performance in recent years.
However, the style control of synthetic speech is often restricted to discrete
emotion categories and requires training data recorded by the target speaker in
the target style. In many practical situations, users may not have reference
speech recorded in target emotion but still be interested in controlling speech
style just by typing text description of desired emotional style. In this
paper, we propose a text-based interface for emotional style control and
cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal
style encoder which models the semantic relationship between text description
embedding and speech style embedding with a pretrained language model. To
further improve cross-speaker style transfer on disjoint, multi-style datasets,
we propose the novel style loss. The experimental results show that our model
can generate high-quality expressive speech even in unseen style.
Related papers
- Style Mixture of Experts for Expressive Text-To-Speech Synthesis [7.6732312922460055]
StyleMoE is an approach that addresses the issue of learning averaged style representations in the style encoder.
The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts layer.
Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech.
arXiv Detail & Related papers (2024-06-05T22:17:47Z) - Expressive TTS Driven by Natural Language Prompts Using Few Human
Annotations [12.891344121936902]
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes.
Recent advancements in TTS empower users with the ability to directly control synthesis style through natural language prompts.
We present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations.
arXiv Detail & Related papers (2023-11-02T14:20:37Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - ZS-MSTM: Zero-Shot Style Transfer for Gesture Animation driven by Text
and Speech using Adversarial Disentanglement of Multimodal Style Encoding [3.609538870261841]
We propose a machine learning approach to synthesize gestures, driven by prosodic features and text, in the style of different speakers.
Our model incorporates zero-shot multimodal style transfer using multimodal data from the PATS database.
arXiv Detail & Related papers (2023-05-22T10:10:35Z) - Conversation Style Transfer using Few-Shot Learning [56.43383396058639]
In this paper, we introduce conversation style transfer as a few-shot learning problem.
We propose a novel in-context learning approach to solve the task with style-free dialogues as a pivot.
We show that conversation style transfer can also benefit downstream tasks.
arXiv Detail & Related papers (2023-02-16T15:27:00Z) - Self-supervised Context-aware Style Representation for Expressive Speech
Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner.
It leverages an emotion lexicon and uses contrastive learning and deep clustering.
Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.