Fine-grained style control in Transformer-based Text-to-speech Synthesis
- URL: http://arxiv.org/abs/2110.06306v1
- Date: Tue, 12 Oct 2021 19:50:02 GMT
- Title: Fine-grained style control in Transformer-based Text-to-speech Synthesis
- Authors: Li-Wei Chen and Alexander Rudnicky
- Abstract summary: We present a novel architecture to realize fine-grained style control on the Transformer-based text-to-speech synthesis (TransformerTTS)
We model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech.
Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability.
- Score: 78.92428622630861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a novel architecture to realize fine-grained style
control on the transformer-based text-to-speech synthesis (TransformerTTS).
Specifically, we model the speaking style by extracting a time sequence of
local style tokens (LST) from the reference speech. The existing content
encoder in TransformerTTS is then replaced by our designed cross-attention
blocks for fusion and alignment between content and style. As the fusion is
performed along with the skip connection, our cross-attention block provides a
good inductive bias to gradually infuse the phoneme representation with a given
style. Additionally, we prevent the style embedding from encoding linguistic
content by randomly truncating LST during training and using wav2vec 2.0
features. Experiments show that with fine-grained style control, our system
performs better in terms of naturalness, intelligibility, and style
transferability. Our code and samples are publicly available.
Related papers
- Style Mixture of Experts for Expressive Text-To-Speech Synthesis [7.6732312922460055]
StyleMoE is an approach that addresses the issue of learning averaged style representations in the style encoder.
The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts layer.
Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech.
arXiv Detail & Related papers (2024-06-05T22:17:47Z) - StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter [78.75422651890776]
StyleCrafter is a generic method that enhances pre-trained T2V models with a style control adapter.
To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image.
StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images.
arXiv Detail & Related papers (2023-12-01T03:53:21Z) - Expressive TTS Driven by Natural Language Prompts Using Few Human
Annotations [12.891344121936902]
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes.
Recent advancements in TTS empower users with the ability to directly control synthesis style through natural language prompts.
We present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations.
arXiv Detail & Related papers (2023-11-02T14:20:37Z) - ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style
Transfer [57.6482608202409]
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning.
We introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles.
We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
arXiv Detail & Related papers (2023-08-29T17:36:02Z) - Improving Diffusion Models for Scene Text Editing with Dual Encoders [44.12999932588205]
Scene text editing is a challenging task that involves modifying or inserting specified texts in an image.
Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing.
We propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design.
arXiv Detail & Related papers (2023-04-12T02:08:34Z) - Fine-Grained Image Style Transfer with Visual Transformers [59.85619519384446]
We propose a novel STyle TRansformer (STTR) network which breaks both content and style images into visual tokens to achieve a fine-grained style transformation.
To compare STTR with existing approaches, we conduct user studies on Amazon Mechanical Turk.
arXiv Detail & Related papers (2022-10-11T06:26:00Z) - Self-supervised Context-aware Style Representation for Expressive Speech
Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner.
It leverages an emotion lexicon and uses contrastive learning and deep clustering.
Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - StylePTB: A Compositional Benchmark for Fine-grained Controllable Text
Style Transfer [90.6768813620898]
Style transfer aims to controllably generate text with targeted stylistic changes while maintaining core meaning from the source sentence constant.
We introduce a large-scale benchmark, StylePTB, with paired sentences undergoing 21 fine-grained stylistic changes spanning atomic lexical, syntactic, semantic, and thematic transfers of text.
We find that existing methods on StylePTB struggle to model fine-grained changes and have an even more difficult time composing multiple styles.
arXiv Detail & Related papers (2021-04-12T04:25:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.