EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast
- URL: http://arxiv.org/abs/2505.23732v1
- Date: Thu, 29 May 2025 17:56:55 GMT
- Title: EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast
- Authors: Shreeram Suresh Chandra, Lucas Goncalves, Junchen Lu, Carlos Busso, Berrak Sisman,
- Abstract summary: We introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations.<n>EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.
- Score: 24.24520233247336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by na\"ively aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.
Related papers
- ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation [30.006550552714938]
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information.<n>Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations.<n>We propose textbfES4R, a framework for speech-based empathetic response generation.
arXiv Detail & Related papers (2026-01-16T10:26:50Z) - Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition [7.243987016933212]
This paper introduces a two-step approach to enhance robustness and generalization of speech emotion recognition models.<n> improved performance under challenging conditions demonstrates the effectiveness of the proposed method.
arXiv Detail & Related papers (2025-10-10T07:17:07Z) - Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement [37.959531845352274]
Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech.<n>This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.
arXiv Detail & Related papers (2025-10-02T07:03:50Z) - Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition [58.74986434825755]
Cross-lingual speech emotion recognition is a challenging task due to differences in phonetic variability and speaker-specific expressive styles.<n>We propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels.<n>Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits.
arXiv Detail & Related papers (2025-09-19T21:03:21Z) - From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition [7.362433184546492]
Dynamic Facial Expression Recognition aims to identify human emotions from temporally evolving facial movements.<n>Our method integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient features.
arXiv Detail & Related papers (2025-07-16T04:15:06Z) - Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions [38.122477830163255]
This paper proposes a novel prompt-unseen-emotion (PUE) approach to generate unseen emotional speech via emotion-guided prompt learning.<n>Our proposed PUE successfully facilitates expressive speech synthesis of unseen emotions in a zero-shot setting.
arXiv Detail & Related papers (2025-06-03T10:59:22Z) - VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection [50.57849622045192]
We propose VAEmo, an efficient framework for emotion-centric joint VA representation learning with external knowledge injection.<n>VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance.
arXiv Detail & Related papers (2025-05-05T03:00:51Z) - GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations [35.63053777817013]
GatedxLSTM is a novel multimodal Emotion Recognition in Conversation (ERC) model.<n>It considers voice and transcripts of both the speaker and their conversational partner to identify the most influential sentences driving emotional shifts.<n>It achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification.
arXiv Detail & Related papers (2025-03-26T18:46:18Z) - EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion [30.25632448893884]
The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance.<n>We propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion.<n>The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages.
arXiv Detail & Related papers (2024-12-29T05:30:06Z) - UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts [64.02363948840333]
UMETTS is a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.<n>EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.<n>EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer [78.35816158511523]
We present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT) for simultaneous subject localization and emotion classification.
We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC.
arXiv Detail & Related papers (2024-04-26T07:30:32Z) - Revealing Emotional Clusters in Speaker Embeddings: A Contrastive
Learning Strategy for Speech Emotion Recognition [27.098672790099304]
It has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization.
Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters.
We introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition.
arXiv Detail & Related papers (2024-01-19T20:31:53Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.