MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages
- URL: http://arxiv.org/abs/2511.04914v3
- Date: Thu, 13 Nov 2025 01:38:53 GMT
- Title: MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages
- Authors: Hardik B. Sailor, Aw Ai Ti, Chen Fang Yih Nancy, Chiu Ying Lay, Ding Yang, He Yingxu, Jiang Ridong, Li Jingtao, Liao Jingyi, Liu Zhuohan, Lu Yanfeng, Ma Yi, Manas Gupta, Muhammad Huzaifah Bin Md Shahrin, Nabilah Binte Md Johan, Nattadaporn Lertcheva, Pan Chunlei, Pham Minh Duc, Siti Maryam Binte Ahmad Subaidi, Siti Umairah Binte Mohammad Salleh, Sun Shuo, Tarun Kumar Vangani, Wang Qiongqiong, Won Cheng Yi Lewis, Wong Heng Meng Jeremy, Wu Jinyang, Zhang Huayun, Zhang Longyin, Zou Xunlong,
- Abstract summary: We present MERaLiON-SER, a robust speech emotion recognition model for English and Southeast Asian languages.<n>The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient losses.<n>We show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs.
- Score: 1.8158194662712928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MERaLiON-SER, a robust speech emotion recognition model designed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), leading to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of specialised speech-only models for accurate paralinguistic understanding and cross-lingual generalisation. Furthermore, the proposed framework provides a foundation for integrating emotion-aware perception into future agentic audio systems, enabling more empathetic and contextually adaptive multimodal reasoning.
Related papers
- A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models [16.195689085967004]
We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models.<n>Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies.
arXiv Detail & Related papers (2026-01-12T14:21:32Z) - A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction [50.05919688888947]
This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT)<n>IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision.<n> Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation.
arXiv Detail & Related papers (2026-01-08T14:07:30Z) - EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning [43.19985438293247]
This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning.<n> EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao.
arXiv Detail & Related papers (2025-11-25T09:26:15Z) - Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition [58.74986434825755]
Cross-lingual speech emotion recognition is a challenging task due to differences in phonetic variability and speaker-specific expressive styles.<n>We propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels.<n>Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits.
arXiv Detail & Related papers (2025-09-19T21:03:21Z) - Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z) - GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness [43.67571101152883]
We introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness.<n> GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization.<n>We show that GOAT-SLM well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions.
arXiv Detail & Related papers (2025-07-24T06:10:29Z) - Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages [31.15696076055884]
We propose leveraging contrastive learning to refine multilingual speech features and extend large language models.<n>Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space.<n>To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER.
arXiv Detail & Related papers (2025-03-25T05:58:18Z) - EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [152.41217651729738]
We propose the EMOVA (EMotionally Omni-present Voice Assistant) to enable Large Language Models with end-to-end speech abilities.<n>With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities.<n>For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks.
arXiv Detail & Related papers (2024-09-26T16:44:02Z) - Cross-Lingual Speech Emotion Recognition: Humans vs. Self-Supervised Models [16.0617753653454]
This study presents a comparative analysis between human performance and SSL models.<n>We also compare the SER ability of models and humans at both utterance- and segment-levels.<n>Our findings reveal that models, with appropriate knowledge transfer, can adapt to the target language and achieve performance comparable to native speakers.
arXiv Detail & Related papers (2024-09-25T13:27:17Z) - Joint Modeling of Code-Switched and Monolingual ASR via Conditional
Factorization [75.98664099579392]
We propose a general framework to jointly model the likelihoods of the monolingual and code-switch sub-tasks that comprise bilingual speech recognition.
We demonstrate the efficacy of our proposed model on bilingual Mandarin-English speech recognition across both monolingual and code-switched corpora.
arXiv Detail & Related papers (2021-11-29T23:14:54Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.