Developing a High-performance Framework for Speech Emotion Recognition   in Naturalistic Conditions Challenge for Emotional Attribute Prediction
        - URL: http://arxiv.org/abs/2506.10930v1
 - Date: Thu, 12 Jun 2025 17:38:06 GMT
 - Title: Developing a High-performance Framework for Speech Emotion Recognition   in Naturalistic Conditions Challenge for Emotional Attribute Prediction
 - Authors: Thanathai Lertpetchpun, Tiantian Feng, Dani Byrd, Shrikanth Narayanan, 
 - Abstract summary: Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community.<n>This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2.<n>Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling.
 - Score: 31.454914712837933
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community. Challenges include disagreement in labeling among annotators and imbalanced data distributions. This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2, evaluated on the MSP-Podcast dataset. Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling. Specifically, our best system is trained by adding text embeddings, predicting gender, and including ``Other'' (O) and ``No Agreement'' (X) samples in the training set. Our system's results secured both first and second places in the IS25-SER Challenge, and the top performance was achieved by a simple two-system ensemble. 
 
       
      
        Related papers
        - Recent Trends in Distant Conversational Speech Recognition: A Review of   CHiME-7 and 8 DASR Challenges [58.80034860169605]
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech.<n>This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions.
arXiv  Detail & Related papers  (2025-07-24T07:56:24Z) - Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion   and Prosodic Features for the Speech Emotion Recognition in Naturalistic   Conditions Challenge at Interspeech 2025 [64.59170359368699]
We present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge.<n>Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues.
arXiv  Detail & Related papers  (2025-06-02T13:46:02Z) - Exploring Generative Error Correction for Dysarthric Speech Recognition [12.584296717901116]
We propose a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025.<n>We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy.<n>We provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition.
arXiv  Detail & Related papers  (2025-05-26T16:06:31Z) - ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic   Conditions Challenge [26.88581786290044]
We present Abhinaya, a system integrating speech-based, text-based, and speech-text models.<n>Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations.<n>To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting.
arXiv  Detail & Related papers  (2025-05-23T08:01:56Z) - Towards Event Extraction from Speech with Contextual Clues [61.164413398231254]
We introduce the Speech Event Extraction (SpeechEE) task and construct three synthetic training sets and one human-spoken test set.
Compared to event extraction from text, SpeechEE poses greater challenges mainly due to complex speech signals that are continuous and have no word boundaries.
Our method brings significant improvements on all datasets, achieving a maximum F1 gain of 10.7%.
arXiv  Detail & Related papers  (2024-01-27T11:07:19Z) - SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified
  Datasets and Multitask Learning [24.57668015470307]
We propose SpeechEQ, a framework for unifying SER tasks based on a multi-scale unified metric.
This metric can be trained by Multitask Learning (MTL), which includes two emotion recognition tasks of Emotion States Category (EIS) and Emotion Intensity Scale (EIS)
We conducted experiments on the public CASIA and ESD datasets in Mandarin, which exhibit that our method outperforms baseline methods by a relatively large margin.
arXiv  Detail & Related papers  (2022-06-27T08:11:54Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
  Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv  Detail & Related papers  (2022-05-09T16:57:35Z) - The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge.
We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data.
Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv  Detail & Related papers  (2022-02-03T14:38:26Z) - An Exploration of Self-Supervised Pretrained Representations for
  End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv  Detail & Related papers  (2021-10-09T15:06:09Z) - ASVspoof 2021: accelerating progress in spoofed and deepfake speech
  detection [70.45884214674057]
ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing.
This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results.
arXiv  Detail & Related papers  (2021-09-01T16:17:31Z) - Exploiting Unsupervised Data for Emotion Recognition in Conversations [76.01690906995286]
Emotion Recognition in Conversations (ERC) aims to predict the emotional state of speakers in conversations.
The available supervised data for the ERC task is limited.
We propose a novel approach to leverage unsupervised conversation data.
arXiv  Detail & Related papers  (2020-10-02T13:28:47Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.