Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction
- URL: http://arxiv.org/abs/2506.10930v1
- Date: Thu, 12 Jun 2025 17:38:06 GMT
- Title: Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction
- Authors: Thanathai Lertpetchpun, Tiantian Feng, Dani Byrd, Shrikanth Narayanan,
- Abstract summary: Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community.<n>This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2.<n>Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling.
- Score: 31.454914712837933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community. Challenges include disagreement in labeling among annotators and imbalanced data distributions. This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2, evaluated on the MSP-Podcast dataset. Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling. Specifically, our best system is trained by adding text embeddings, predicting gender, and including ``Other'' (O) and ``No Agreement'' (X) samples in the training set. Our system's results secured both first and second places in the IS25-SER Challenge, and the top performance was achieved by a simple two-system ensemble.
Related papers
- Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges [58.80034860169605]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv Detail & Related papers (2025-07-24T07:56:24Z) - Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025 [64.59170359368699]
We present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge.<n>Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues.
arXiv Detail & Related papers (2025-06-02T13:46:02Z) - Exploring Generative Error Correction for Dysarthric Speech Recognition [12.584296717901116]
We propose a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025.<n>We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy.<n>We provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition.
arXiv Detail & Related papers (2025-05-26T16:06:31Z) - ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic Conditions Challenge [26.88581786290044]
We present Abhinaya, a system integrating speech-based, text-based, and speech-text models.<n>Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations.<n>To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting.
arXiv Detail & Related papers (2025-05-23T08:01:56Z) - Towards Event Extraction from Speech with Contextual Clues [61.164413398231254]
We introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set.
Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries.
Our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%.
arXiv Detail & Related papers (2024-01-27T11:07:19Z) - SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified
Datasets and Multitask Learning [24.57668015470307]
We propose SpeechEQ, a framework for unifying SER tasks based on a multi-scale unified metric.
This metric can be trained by Multitask Learning (MTL), which includes two emotion recognition tasks of Emotion States Category (EIS) and Emotion Intensity Scale (EIS)
We conducted experiments on the public CASIA and ESD datasets in Mandarin, which exhibit that our method outperforms baseline methods by a relatively large margin.
arXiv Detail & Related papers (2022-06-27T08:11:54Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge.
We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data.
Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv Detail & Related papers (2022-02-03T14:38:26Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - ASVspoof 2021: accelerating progress in spoofed and deepfake speech
detection [70.45884214674057]
ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing.
This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results.
arXiv Detail & Related papers (2021-09-01T16:17:31Z) - Exploiting Unsupervised Data for Emotion Recognition in Conversations [76.01690906995286]
Emotion Recognition in Conversations (ERC) aims to predict the emotional state of speakers in conversations.
The available supervised data for the ERC task is limited.
We propose a novel approach to leverage unsupervised conversation data.
arXiv Detail & Related papers (2020-10-02T13:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.