Expressive TTS Driven by Natural Language Prompts Using Few Human
Annotations
- URL: http://arxiv.org/abs/2311.01260v1
- Date: Thu, 2 Nov 2023 14:20:37 GMT
- Title: Expressive TTS Driven by Natural Language Prompts Using Few Human
Annotations
- Authors: Hanglei Zhang, Yiwei Guo, Sen Liu, Xie Chen, Kai Yu
- Abstract summary: Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes.
Recent advancements in TTS empower users with the ability to directly control synthesis style through natural language prompts.
We present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations.
- Score: 12.891344121936902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Expressive text-to-speech (TTS) aims to synthesize speeches with human-like
tones, moods, or even artistic attributes. Recent advancements in expressive
TTS empower users with the ability to directly control synthesis style through
natural language prompts. However, these methods often require excessive
training with a significant amount of style-annotated data, which can be
challenging to acquire. Moreover, they may have limited adaptability due to
fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a
controllable expressive TTS model with minimal human annotations. Our approach
utilizes a large language model (LLM) to transform expressive TTS into a style
retrieval task. The LLM selects the best-matching style references from
annotated utterances based on external style prompts, which can be raw input
text or natural language style descriptions. The selected reference guides the
TTS pipeline to synthesize speeches with the intended style. This innovative
approach provides flexible, versatile, and precise style control with minimal
human workload. Experiments on a Mandarin storytelling corpus demonstrate
FS-TTS's proficiency in leveraging LLM's semantic inference ability to retrieve
desired styles from either input text or user-defined descriptions. This
results in synthetic speeches that are closely aligned with the specified
styles.
Related papers
- DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning [12.069474749489897]
We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics.
Results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset.
arXiv Detail & Related papers (2024-06-12T07:49:21Z) - Style Mixture of Experts for Expressive Text-To-Speech Synthesis [7.6732312922460055]
StyleMoE is an approach that addresses the issue of learning averaged style representations in the style encoder.
The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts layer.
Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech.
arXiv Detail & Related papers (2024-06-05T22:17:47Z) - ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style
Transfer [57.6482608202409]
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning.
We introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles.
We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
arXiv Detail & Related papers (2023-08-29T17:36:02Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized
Tokenizer of a Large-Scale Generative Model [64.26721402514957]
We propose StylerDALLE, a style transfer method that uses natural language to describe abstract art styles.
Specifically, we formulate the language-guided style transfer task as a non-autoregressive token sequence translation.
To incorporate style information, we propose a Reinforcement Learning strategy with CLIP-based language supervision.
arXiv Detail & Related papers (2023-03-16T12:44:44Z) - Text-driven Emotional Style Control and Cross-speaker Style Transfer in
Neural TTS [7.384726530165295]
Style control of synthetic speech is often restricted to discrete emotion categories.
We propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS.
arXiv Detail & Related papers (2022-07-13T07:05:44Z) - Self-supervised Context-aware Style Representation for Expressive Speech
Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner.
It leverages an emotion lexicon and uses contrastive learning and deep clustering.
Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Fine-grained style control in Transformer-based Text-to-speech Synthesis [78.92428622630861]
We present a novel architecture to realize fine-grained style control on the Transformer-based text-to-speech synthesis (TransformerTTS)
We model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech.
Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability.
arXiv Detail & Related papers (2021-10-12T19:50:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.