Related papers: MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition

MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition

URL: http://arxiv.org/abs/2407.05746v1
Date: Mon, 8 Jul 2024 08:52:06 GMT
Title: MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition
Authors: Jarod Duret, Mickael Rouvier, Yannick Estève,
Abstract summary: We submit to the 2024 edition of the MSP-Podcast Speech Emotion Recognition (SER) Challenge. This challenge is divided into two distinct tasks: Categorical Emotion Recognition and Emotional Attribute Prediction. Our approach employs an ensemble of models, each trained independently and then fused at the score level using a Support Vector Machine (SVM) This joint training methodology aims to enhance the system's ability to accurately classify emotional states.
Score: 12.808666808009926
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we detail our submission to the 2024 edition of the MSP-Podcast Speech Emotion Recognition (SER) Challenge. This challenge is divided into two distinct tasks: Categorical Emotion Recognition and Emotional Attribute Prediction. We concentrated our efforts on Task 1, which involves the categorical classification of eight emotional states using data from the MSP-Podcast dataset. Our approach employs an ensemble of models, each trained independently and then fused at the score level using a Support Vector Machine (SVM) classifier. The models were trained using various strategies, including Self-Supervised Learning (SSL) fine-tuning across different modalities: speech alone, text alone, and a combined speech and text approach. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. Thus, the system obtained F1-macro of 0.35\% on development set.

Related papers

MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions [46.34220791244788]
MEDUSA is a multimodal framework with a four-stage training pipeline.<n>DeepSER is a novel extension of a deep cross-modal transformer fusion mechanism.<n>Manor MixUp is employed for further regularization.
arXiv Detail & Related papers (2025-06-11T09:41:23Z)
Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025 [64.59170359368699]
We present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge.<n>Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues.
arXiv Detail & Related papers (2025-06-02T13:46:02Z)
Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation [58.189703277322224]
Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion. Emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. We propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation.
arXiv Detail & Related papers (2025-04-08T04:34:38Z)
Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout [5.721743498917423]
We introduce EmoVCLIP, a model fine-tuned based on CLIP. We employ modality dropout for robust information fusion. Lastly, we utilize a self-training strategy to leverage unlabeled videos.
arXiv Detail & Related papers (2024-09-11T08:06:47Z)
Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances [3.396456345114466]
We propose SpeechCueLLM, a method that translates speech characteristics into natural language descriptions. We evaluate SpeechCueLLM on two datasets: IEMOCAP and MELD, showing significant improvements in emotion recognition accuracy.
arXiv Detail & Related papers (2024-07-31T03:53:14Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time. We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z)
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z)
Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features. We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models. Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z)
An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes. Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z)
Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations. We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features. We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.