Related papers: Style Mixture of Experts for Expressive Text-To-Speech Synthesis

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

URL: http://arxiv.org/abs/2406.03637v2
Date: Mon, 28 Oct 2024 01:29:04 GMT
Title: Style Mixture of Experts for Expressive Text-To-Speech Synthesis
Authors: Ahad Jawaid, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman,
Abstract summary: StyleMoE is an approach that addresses the issue of learning averaged style representations in the style encoder. The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts layer. Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech.
Score: 7.6732312922460055
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advances in style transfer text-to-speech (TTS) have improved the expressiveness of synthesized speech. However, encoding stylistic information (e.g., timbre, emotion, and prosody) from diverse and unseen reference speech remains a challenge. This paper introduces StyleMoE, an approach that addresses the issue of learning averaged style representations in the style encoder by creating style experts that learn from subsets of data. The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts (MoE) layer. The style experts specialize by learning from subsets of reference speech routed to them by the gating network, enabling them to handle different aspects of the style space. As a result, StyleMoE improves the style coverage of the style encoder for style transfer TTS. Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech. The proposed method enhances the performance of existing state-of-the-art style transfer TTS models and represents the first study of style MoE in TTS.

Related papers

Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech [26.656512860918262]
We propose Spotlight-TTS, which emphasizes style via voiced-aware style extraction and style direction adjustment.<n>We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality.
arXiv Detail & Related papers (2025-05-27T08:20:01Z)
Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition [1.03590082373586]
A dataset for fashion style recognition is challenging due to the inherent subjectivity and ambiguity of style concepts. Recent advances in text-to-image models have facilitated generative data augmentation by synthesizing images from labeled data. We propose textbfMasked Language Prompting (MLP), a novel prompting strategy that masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherents.
arXiv Detail & Related papers (2025-04-28T03:42:42Z)
StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter [78.75422651890776]
StyleCrafter is a generic method that enhances pre-trained T2V models with a style control adapter. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images.
arXiv Detail & Related papers (2023-12-01T03:53:21Z)
Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations [12.891344121936902]
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in TTS empower users with the ability to directly control synthesis style through natural language prompts. We present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations.
arXiv Detail & Related papers (2023-11-02T14:20:37Z)
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer [57.6482608202409]
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning. We introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles. We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
arXiv Detail & Related papers (2023-08-29T17:36:02Z)
StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model [64.26721402514957]
We propose StylerDALLE, a style transfer method that uses natural language to describe abstract art styles. Specifically, we formulate the language-guided style transfer task as a non-autoregressive token sequence translation. To incorporate style information, we propose a Reinforcement Learning strategy with CLIP-based language supervision.
arXiv Detail & Related papers (2023-03-16T12:44:44Z)
Conversation Style Transfer using Few-Shot Learning [56.43383396058639]
In this paper, we introduce conversation style transfer as a few-shot learning problem. We propose a novel in-context learning approach to solve the task with style-free dialogues as a pivot. We show that conversation style transfer can also benefit downstream tasks.
arXiv Detail & Related papers (2023-02-16T15:27:00Z)
Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS [7.384726530165295]
Style control of synthetic speech is often restricted to discrete emotion categories. We propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS.
arXiv Detail & Related papers (2022-07-13T07:05:44Z)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
Fine-grained style control in Transformer-based Text-to-speech Synthesis [78.92428622630861]
We present a novel architecture to realize fine-grained style control on the Transformer-based text-to-speech synthesis (TransformerTTS) We model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability.
arXiv Detail & Related papers (2021-10-12T19:50:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.