Related papers: Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst

Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst

URL: http://arxiv.org/abs/2209.07629v1
Date: Thu, 15 Sep 2022 22:06:42 GMT
Title: Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst
Authors: Dang-Linh Trinh, Minh-Cong Vo, Guee-Sang Lee
Abstract summary: The technical report presents our emotion recognition pipeline for high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts (A-VB) 2022 Workshop & Competition. By empirical experiments, our proposed method achieves a mean correlation coefficient (CCC) of 0.7295 on the test set, compared to 0.5686 on the baseline model.
Score: 4.6193503399184275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The technical report presents our emotion recognition pipeline for high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts (A-VB) 2022 Workshop \& Competition. Our proposed method contains three stages. Firstly, we extract the latent features from the raw audio signal and its Mel-spectrogram by self-supervised learning methods. Then, the features from the raw signal are fed to the self-relation attention and temporal awareness (SA-TA) module for learning the valuable information between these latent features. Finally, we concatenate all the features and utilize a fully-connected layer to predict each emotion's score. By empirical experiments, our proposed method achieves a mean concordance correlation coefficient (CCC) of 0.7295 on the test set, compared to 0.5686 on the baseline model. The code of our method is available at https://github.com/linhtd812/A-VB2022.

Related papers

Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025 [64.59170359368699]
We present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge.<n>Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues.
arXiv Detail & Related papers (2025-06-02T13:46:02Z)
Mind the Gap! Static and Interactive Evaluations of Large Audio Models [55.87220295533817]
Large Audio Models (LAMs) are designed to power voice-native experiences. This study introduces an interactive approach to evaluate LAMs and collect 7,500 LAM interactions from 484 participants.
arXiv Detail & Related papers (2025-02-21T20:29:02Z)
EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation [34.24557248359872]
We propose an emotional inertia and contagion-driven dependency modeling approach (EmotionIC) for ERC task. Our EmotionIC consists of three main components, i.e., Identity Masked Multi-Head Attention (IMMHA), Dialogue-based Gated Recurrent Unit (DiaGRU) and Skip-chain Conditional Random Field (SkipCRF) Experimental results show that our method can significantly outperform the state-of-the-art models on four benchmark datasets.
arXiv Detail & Related papers (2023-03-20T13:58:35Z)
A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts. To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules. The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z)
Jointly Learning Visual and Auditory Speech Representations from Raw Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our design is asymmetric w.r.t. driven by the inherent differences between video and audio. RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z)
Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3 [15.077019278082673]
Cross-modal co-attention model for continuous emotion recognition. Visual, audio, and linguistic blocks are used to learn the features of the multimodal input. Cross-validation is carried out on the training and validation set.
arXiv Detail & Related papers (2022-03-24T12:18:06Z)
An Attention-based Method for Action Unit Detection at the 3rd ABAW Competition [6.229820412732652]
This paper describes our submission to the third Affective Behavior Analysis in-the-wild (ABAW) competition 2022. We proposed a method for detecting facial action units in the video. We achieved a macro F1 score of 0.48 on the ABAW challenge validation set compared to 0.39 from the baseline model.
arXiv Detail & Related papers (2022-03-23T14:07:39Z)
Sentiment-Aware Automatic Speech Recognition pre-training for enhanced Speech Emotion Recognition [11.760166084942908]
We propose a novel multi-task pre-training method for Speech Emotion Recognition (SER) We pre-train SER model simultaneously on Automatic Speech Recognition (ASR) and sentiment classification tasks. We generate targets for the sentiment classification using text-to-sentiment model trained on publicly available data.
arXiv Detail & Related papers (2022-01-27T22:20:28Z)
On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition [2.294014185517203]
We use wav2vec and camemBERT as self-supervised learned models to represent our data in order to perform continuous emotion recognition from speech. To the authors' knowledge, this paper presents the first study showing that the joint use of wav2vec and BERT-like pre-trained features is very relevant to deal with continuous SER task.
arXiv Detail & Related papers (2020-11-18T11:10:29Z)
Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content. We extract and analyze the similarity between the two audio and visual modalities from within the same video. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z)
EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's Principle [71.47160118286226]
We present EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images. Motivated by Frege's Context Principle from psychology, our approach combines three interpretations of context for emotion recognition. We report an Average Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8 over prior methods.
arXiv Detail & Related papers (2020-03-14T19:55:21Z)
End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition. We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention. We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z)
Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping [55.72376663488104]
We present an autoencoder-based approach to classify perceived human emotions from walking styles obtained from videos or motion-captured data. Given the motion on each joint in the pose at each time step extracted from 3D pose sequences, we hierarchically pool these joint motions in the encoder. We train the decoder to reconstruct the motions per joint per time step in a top-down manner from the latent embeddings.
arXiv Detail & Related papers (2019-11-20T05:04:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.