EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis
- URL: http://arxiv.org/abs/2601.22873v1
- Date: Fri, 30 Jan 2026 11:50:23 GMT
- Title: EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis
- Authors: Li Zhou, Hao Jiang, Junjie Li, Tianrui Wang, Haizhou Li,
- Abstract summary: We present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer.<n>EmoShift learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression.<n>With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations.
- Score: 36.831497786147864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer's effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.
Related papers
- CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering [25.10244503397448]
Emotional expression in human speech is nuanced and compositional, often involving multiple, conflicting, affective cues.<n>Most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression.<n>This paper introduces a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis.
arXiv Detail & Related papers (2026-02-03T11:45:00Z) - RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF [23.474332076771308]
Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge.<n>We propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques.<n>Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
arXiv Detail & Related papers (2025-10-16T12:40:37Z) - EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z) - UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech [61.989360995528905]
We propose UDDETTS, a universal framework unifying discrete and dimensional emotions for controllable emotional TTS.<n>This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values.<n>Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities.
arXiv Detail & Related papers (2025-05-15T12:57:19Z) - When Words Smile: Generating Diverse Emotional Facial Expressions from Text [77.1867389815291]
We introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics.<n>Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent.
arXiv Detail & Related papers (2024-12-03T15:39:05Z) - EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector [26.656512860918262]
EmoSphere++ is an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech.<n>We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation.<n>We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps.
arXiv Detail & Related papers (2024-11-04T21:33:56Z) - EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control [7.596581158724187]
EmoKnob is a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion.
We show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
arXiv Detail & Related papers (2024-10-01T01:29:54Z) - EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech [34.03787613163788]
EmoSphere-TTS synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech.
We propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics.
arXiv Detail & Related papers (2024-06-12T01:40:29Z) - UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts [64.02363948840333]
UMETTS is a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.<n>EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.<n>EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - Enhancing Emotional Generation Capability of Large Language Models via Emotional Chain-of-Thought [50.13429055093534]
Large Language Models (LLMs) have shown remarkable performance in various emotion recognition tasks.
We propose the Emotional Chain-of-Thought (ECoT) to enhance the performance of LLMs on various emotional generation tasks.
arXiv Detail & Related papers (2024-01-12T16:42:10Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.