Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation
- URL: http://arxiv.org/abs/2509.20378v1
- Date: Sat, 20 Sep 2025 14:26:15 GMT
- Title: Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation
- Authors: Sirui Wang, Andong Chen, Tiejun Zhao,
- Abstract summary: Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction.<n>We introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS.<n>Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations.<n>It maps them through a Feature-wise Linear Modulation layer, enabling word-level emotion control by directly modulating text embeddings.
- Score: 27.668177917370144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction. Existing systems typically rely on sentence-level control through predefined labels, reference audio, or natural language prompts. While effective for global emotion expression, these approaches fail to capture dynamic shifts within a sentence. To address this limitation, we introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS. Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations, and maps them through a Feature-wise Linear Modulation (FiLM) layer, enabling word-level emotion control by directly modulating text embeddings. To support evaluation, we construct the Fine-grained Emotion Dynamics Dataset (FEDD) with detailed annotations of emotional transitions. Experiments show that Emo-FiLM outperforms existing approaches on both global and fine-grained tasks, demonstrating its effectiveness and generality for expressive speech synthesis.
Related papers
- EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis [36.831497786147864]
We present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer.<n>EmoShift learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression.<n>With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations.
arXiv Detail & Related papers (2026-01-30T11:50:23Z) - MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization [37.71704315894968]
Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks.<n>MUSE is the first unified framework capable of both emotional generation and editing.
arXiv Detail & Related papers (2025-11-26T04:39:25Z) - EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z) - UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech [61.989360995528905]
We propose UDDETTS, a universal framework unifying discrete and dimensional emotions for controllable emotional TTS.<n>This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values.<n>Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities.
arXiv Detail & Related papers (2025-05-15T12:57:19Z) - EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector [26.656512860918262]
EmoSphere++ is an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech.<n>We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation.<n>We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps.
arXiv Detail & Related papers (2024-11-04T21:33:56Z) - EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control [7.596581158724187]
EmoKnob is a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion.
We show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
arXiv Detail & Related papers (2024-10-01T01:29:54Z) - UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts [64.02363948840333]
UMETTS is a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.<n>EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.<n>EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.