Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages
- URL: http://arxiv.org/abs/2503.21806v1
- Date: Tue, 25 Mar 2025 05:58:18 GMT
- Title: Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages
- Authors: Heqing Zou, Fengmao Lv, Desheng Zheng, Eng Siong Chng, Deepu Rajan,
- Abstract summary: We propose leveraging contrastive learning to refine multilingual speech features and extend large language models.<n>Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space.<n>To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER.
- Score: 31.15696076055884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.
Related papers
- EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [152.41217651729738]
We propose the EMOVA (EMotionally Omni-present Voice Assistant) to enable Large Language Models with end-to-end speech abilities.<n>With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities.<n>For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks.
arXiv Detail & Related papers (2024-09-26T16:44:02Z) - CLARA: Multilingual Contrastive Learning for Audio Representation
Acquisition [5.520654376217889]
CLARA minimizes reliance on labelled data, enhancing generalization across languages.
Our approach adeptly captures emotional nuances in speech, overcoming subjective assessment issues.
It adapts to low-resource languages, marking progress in multilingual speech representation learning.
arXiv Detail & Related papers (2023-10-18T09:31:56Z) - Effect of Attention and Self-Supervised Speech Embeddings on
Non-Semantic Speech Tasks [3.570593982494095]
We look at speech emotion understanding as a perception task which is a more realistic setting.
We leverage ComParE rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion.
Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
arXiv Detail & Related papers (2023-08-28T07:11:27Z) - AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect
Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations.
We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space.
We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z) - Cross-Lingual Cross-Age Group Adaptation for Low-Resource Elderly Speech
Emotion Recognition [48.29355616574199]
We analyze the transferability of emotion recognition across three different languages--English, Mandarin Chinese, and Cantonese.
This study concludes that different language and age groups require specific speech features, thus making cross-lingual inference an unsuitable method.
arXiv Detail & Related papers (2023-06-26T08:48:08Z) - MoLE : Mixture of Language Experts for Multi-Lingual Automatic Speech
Recognition [12.23416994447554]
We present a multi-lingual speech recognition network named Mixture-of-Language-Expert(MoLE)
MoLE analyzes linguistic expression from input speech in arbitrary languages, activating a language-specific expert with a lightweight language tokenizer.
Based on the reliability, the activated expert and the language-agnostic expert are aggregated to represent language-conditioned embedding.
arXiv Detail & Related papers (2023-02-27T13:26:17Z) - Multilingual Speech Emotion Recognition With Multi-Gating Mechanism and
Neural Architecture Search [15.51730246937201]
Speech emotion recognition (SER) classifies audio into emotion categories such as Happy, Angry, Fear, Disgust and Neutral.
This paper proposes a language-specific model that extract emotional information from multiple pre-trained speech models.
Our model raises the state-of-the-art accuracy by 3% for German and 14.3% for French.
arXiv Detail & Related papers (2022-10-31T19:55:33Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.