Related papers: Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech

Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech

URL: http://arxiv.org/abs/2505.20868v2
Date: Sun, 29 Jun 2025 12:42:04 GMT
Title: Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
Authors: Nam-Gyu Kim, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee,
Abstract summary: We propose Spotlight-TTS, which emphasizes style via voiced-aware style extraction and style direction adjustment.<n>We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality.
Score: 26.656512860918262
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose Spotlight-TTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions highly related to style while maintaining continuity across different speech regions to improve expressiveness. We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality. Experimental results demonstrate that Spotlight-TTS achieves superior performance compared to baseline models in terms of expressiveness, overall speech quality, and style transfer capability. Our audio samples are publicly available.

Related papers

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion [53.26424100244925]
Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech.<n>In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder.
arXiv Detail & Related papers (2025-06-04T14:42:12Z)
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [56.25862714128288]
This paper introduces textitMegaTTS 3, a zero-shot text-to-speech (TTS) system featuring an innovative sparse alignment algorithm.<n>Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space.<n>Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity.
arXiv Detail & Related papers (2025-02-26T08:22:00Z)
DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability [7.005068872406135]
Diffusion-based EXpressive TTS (DEX-TTS) is an acoustic model designed for reference-based speech synthesis with enhanced style representations. DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS.
arXiv Detail & Related papers (2024-06-27T12:39:55Z)
Style Mixture of Experts for Expressive Text-To-Speech Synthesis [7.6732312922460055]
StyleMoE is an approach that addresses the issue of learning averaged style representations in the style encoder. The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts layer. Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech.
arXiv Detail & Related papers (2024-06-05T22:17:47Z)
StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution. We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z)
Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model. It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z)
Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS [7.384726530165295]
Style control of synthetic speech is often restricted to discrete emotion categories. We propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS.
arXiv Detail & Related papers (2022-07-13T07:05:44Z)
StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis [23.17929822987861]
StyleTTS is a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. Our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets.
arXiv Detail & Related papers (2022-05-30T21:34:40Z)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech. During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.