Related papers: What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

URL: http://arxiv.org/abs/2406.09933v1
Date: Fri, 14 Jun 2024 11:27:19 GMT
Title: What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark
Authors: Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed,
Abstract summary: Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER's capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets.
Score: 13.820963986497128
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER's capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets. In particular, we incorporate 11 emotional speech datasets and illustrate a comprehensive benchmark on the SER task. We also address the challenge of imbalanced data distribution using over-sampling methods when combining SER datasets for training. Furthermore, we explore various evaluation protocols for adeptness in the generalization of SER. Building on this, we explore the potential of Whisper for SER, emphasizing the importance of thorough evaluation. Our approach is designed to advance SER technology by integrating speaker-independent methods.

Related papers

Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention [0.5371337604556311]
Speech Emotion Recognition (SER) traditionally relies on auditory data analysis for emotion classification.<n>We use Mel-Frequency Cepstral Coefficients (MFCCs) as spectral features to bridge the gap between computational emotion processing and human auditory perception.<n>We propose a novel 1D-CNN-based SER framework that integrates data augmentation techniques.
arXiv Detail & Related papers (2025-07-04T01:55:49Z)
Summarizing Speech: A Comprehensive Survey [76.13011304983458]
Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content.<n>This survey examines existing datasets and evaluation protocols, which are crucial for assessing the quality of summarization approaches.
arXiv Detail & Related papers (2025-04-10T17:50:53Z)
Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics [4.137346786534721]
This study investigates the key characteristics and suitability of widely used Facial Expression Recognition (FER) datasets for training deep learning models. We compiled and analyzed 24 FER datasets, including those targeting specific age groups such as children, adults, and the elderly. Benchmark experiments using state-of-the-art neural networks reveal that large-scale, automatically collected datasets tend to generalize better.
arXiv Detail & Related papers (2025-03-26T11:01:00Z)
ILAEDA: An Imitation Learning Based Approach for Automatic Exploratory Data Analysis [5.012314384895538]
We argue that not all of the essential features of what makes an operation important can be accurately captured mathematically using rewards. We propose an AutoEDA model trained through imitation learning from expert EDA sessions, bypassing the need for manually defined interestingness measures. Our method outperforms the existing state-of-the-art end-to-end EDA approach on benchmarks by upto 3x, showing strong performance and generalization.
arXiv Detail & Related papers (2024-10-15T04:56:13Z)
Speech Emotion Recognition under Resource Constraints with Data Distillation [64.36799373890916]
Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things presents challenges in constructing intricate deep learning models. We propose a data distillation framework to facilitate efficient development of SER models in IoT applications.
arXiv Detail & Related papers (2024-06-21T13:10:46Z)
BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics [7.68184437595058]
We present BIRB, a complex benchmark centered on the retrieval of bird vocalizations from passively-recorded datasets. We propose a baseline system for this collection of tasks using representation learning and a nearest-centroid search.
arXiv Detail & Related papers (2023-12-12T17:06:39Z)
Speech and Text-Based Emotion Recognizer [0.9168634432094885]
We build a balanced corpus from publicly available datasets for speech emotion recognition. Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
arXiv Detail & Related papers (2023-12-10T05:17:39Z)
SER_AMPEL: a multi-source dataset for speech emotion recognition of Italian older adults [58.49386651361823]
SER_AMPEL is a multi-source dataset for speech emotion recognition (SER) It is collected with the aim of providing a reference for speech emotion recognition in case of Italian older adults. The evidence of the need for such a dataset emerges from the analysis of the state of the art.
arXiv Detail & Related papers (2023-11-24T13:47:25Z)
End-to-End Continuous Speech Emotion Recognition in Real-life Customer Service Call Center Conversations [0.0]
We present our approach to constructing a large-scale reallife dataset (CusEmo) for continuous SER in customer service call center conversations. We adopted the dimensional emotion annotation approach to capture the subtlety, complexity, and continuity of emotions in real-life call center conversations. The study also addresses the challenges encountered during the application of the End-to-End (E2E) SER system to the dataset.
arXiv Detail & Related papers (2023-10-02T11:53:48Z)
Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance. This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings. Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z)
Attribute Inference Attack of Speech Emotion Recognition in Federated Learning Settings [56.93025161787725]
Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing local data. We propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters. We show that the attribute inference attack is achievable for SER systems trained using FL.
arXiv Detail & Related papers (2021-12-26T16:50:42Z)
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity. We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
Sense and Learn: Self-Supervision for Omnipresent Sensors [9.442811508809994]
We present a framework named Sense and Learn for representation or feature learning from raw sensory data. It consists of several auxiliary tasks that can learn high-level and broadly useful features entirely from unannotated data without any human involvement in the tedious labeling process. Our methodology achieves results that are competitive with the supervised approaches and close the gap through fine-tuning a network while learning the downstream tasks in most cases.
arXiv Detail & Related papers (2020-09-28T11:57:43Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.