Related papers: Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

URL: http://arxiv.org/abs/2503.21806v1
Date: Tue, 25 Mar 2025 05:58:18 GMT
Title: Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages
Authors: Heqing Zou, Fengmao Lv, Desheng Zheng, Eng Siong Chng, Deepu Rajan,
Abstract summary: We propose leveraging contrastive learning to refine multilingual speech features and extend large language models.<n>Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space.<n>To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER.
Score: 31.15696076055884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.

Related papers

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [152.41217651729738]
We propose the EMOVA (EMotionally Omni-present Voice Assistant) to enable Large Language Models with end-to-end speech abilities.<n>With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities.<n>For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks.
arXiv Detail & Related papers (2024-09-26T16:44:02Z)
CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition [5.520654376217889]
CLARA minimizes reliance on labelled data, enhancing generalization across languages. Our approach adeptly captures emotional nuances in speech, overcoming subjective assessment issues. It adapts to low-resource languages, marking progress in multilingual speech representation learning.
arXiv Detail & Related papers (2023-10-18T09:31:56Z)
Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks [3.570593982494095]
We look at speech emotion understanding as a perception task which is a more realistic setting. We leverage ComParE rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
arXiv Detail & Related papers (2023-08-28T07:11:27Z)
AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations. We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space. We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z)
Cross-Lingual Cross-Age Group Adaptation for Low-Resource Elderly Speech Emotion Recognition [48.29355616574199]
We analyze the transferability of emotion recognition across three different languages--English, Mandarin Chinese, and Cantonese. This study concludes that different language and age groups require specific speech features, thus making cross-lingual inference an unsuitable method.
arXiv Detail & Related papers (2023-06-26T08:48:08Z)
MoLE : Mixture of Language Experts for Multi-Lingual Automatic Speech Recognition [12.23416994447554]
We present a multi-lingual speech recognition network named Mixture-of-Language-Expert(MoLE) MoLE analyzes linguistic expression from input speech in arbitrary languages, activating a language-specific expert with a lightweight language tokenizer. Based on the reliability, the activated expert and the language-agnostic expert are aggregated to represent language-conditioned embedding.
arXiv Detail & Related papers (2023-02-27T13:26:17Z)
Multilingual Speech Emotion Recognition With Multi-Gating Mechanism and Neural Architecture Search [15.51730246937201]
Speech emotion recognition (SER) classifies audio into emotion categories such as Happy, Angry, Fear, Disgust and Neutral. This paper proposes a language-specific model that extract emotional information from multiple pre-trained speech models. Our model raises the state-of-the-art accuracy by 3% for German and 14.3% for French.
arXiv Detail & Related papers (2022-10-31T19:55:33Z)
Perception Point: Identifying Critical Learning Periods in Speech for Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models. We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z)
Exploring Teacher-Student Learning Approach for Multi-lingual Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages. We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.