Related papers: INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition

INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition

URL: http://arxiv.org/abs/2406.06401v1
Date: Mon, 10 Jun 2024 15:55:06 GMT
Title: INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition
Authors: Andreas Triantafyllopoulos, Anton Batliner, Simon Rampp, Manuel Milling, Björn Schuller,
Abstract summary: We revisit the INTERSPEECH 2009 Emotion Challenge -- the first ever speech emotion recognition (SER) challenge. We evaluate a series of deep learning models that are representative of the major advances in SER research.
Score: 5.303788012608604
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We revisit the INTERSPEECH 2009 Emotion Challenge -- the first ever speech emotion recognition (SER) challenge -- and evaluate a series of deep learning models that are representative of the major advances in SER research in the time since then. We start by training each model using a fixed set of hyperparameters, and further fine-tune the best-performing models of that initial setup with a grid search. Results are always reported on the official test set with a separate validation set only used for early stopping. Most models score below or close to the official baseline, while they marginally outperform the original challenge winners after hyperparameter tuning. Our work illustrates that, despite recent progress, FAU-AIBO remains a very challenging benchmark. An interesting corollary is that newer methods do not consistently outperform older ones, showing that progress towards `solving' SER is not necessarily monotonic.

Related papers

Energy-Based Transformers are Scalable Learners and Thinkers [84.7474634026213]
Energy-Based Transformers (EBTs) are a new class of Energy-Based Models (EBMs)<n>We train EBTs to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy until convergence.<n>During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks.
arXiv Detail & Related papers (2025-07-02T19:17:29Z)
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities. We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z)
How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey [59.23394353614928]
In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges.
arXiv Detail & Related papers (2024-12-11T07:29:04Z)
Beyond human subjectivity and error: a novel AI grading system [67.410870290301]
The grading of open-ended questions is a high-effort, high-impact task in education. Recent breakthroughs in AI technology might facilitate such automation, but this has not been demonstrated at scale. We introduce a novel automatic short answer grading (ASAG) system.
arXiv Detail & Related papers (2024-05-07T13:49:59Z)
Leveraging TCN and Transformer for effective visual-audio fusion in continuous emotion recognition [0.5370906227996627]
We present our approach to the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge. We propose a novel multi-modal fusion model that leverages Temporal Convolutional Networks (TCN) and Transformer to enhance the performance of continuous emotion recognition.
arXiv Detail & Related papers (2023-03-15T04:15:57Z)
Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue [92.01165203498299]
Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange. This paper argues that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research.
arXiv Detail & Related papers (2022-10-10T05:51:40Z)
An Overview & Analysis of Sequence-to-Sequence Emotional Voice Conversion [8.94336505787464]
Sequence-to-sequence modelling is emerging as a competitive paradigm for models to overcome EVC challenges. Recent sequence-to-sequence EVC papers were investigated and reviewed from six perspectives. This information is organised to provide the research community with an easily digestible overview of the current state-of-the-art.
arXiv Detail & Related papers (2022-03-29T19:41:34Z)
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z)
A Brief Survey and Comparative Study of Recent Development of Pronoun Coreference Resolution [55.39835612617972]
Pronoun Coreference Resolution (PCR) is the task of resolving pronominal expressions to all mentions they refer to. As one important natural language understanding (NLU) component, pronoun resolution is crucial for many downstream tasks and still challenging for existing models. We conduct extensive experiments to show that even though current models are achieving good performance on the standard evaluation set, they are still not ready to be used in real applications.
arXiv Detail & Related papers (2020-09-27T01:40:01Z)
TAL EmotioNet Challenge 2020 Rethinking the Model Chosen Problem in Multi-Task Learning [24.365090805937083]
We pose the AU recognition problem as a multi-task learning problem. The co-occurrence of the expression features and the head pose features are explored. By choosing the optimal checkpoint for each AU, the recognition results are improved.
arXiv Detail & Related papers (2020-04-21T09:39:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.