Self-Relation Attention and Temporal Awareness for Emotion Recognition
via Vocal Burst
- URL: http://arxiv.org/abs/2209.07629v1
- Date: Thu, 15 Sep 2022 22:06:42 GMT
- Title: Self-Relation Attention and Temporal Awareness for Emotion Recognition
via Vocal Burst
- Authors: Dang-Linh Trinh, Minh-Cong Vo, Guee-Sang Lee
- Abstract summary: The technical report presents our emotion recognition pipeline for high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts (A-VB) 2022 Workshop & Competition.
By empirical experiments, our proposed method achieves a mean correlation coefficient (CCC) of 0.7295 on the test set, compared to 0.5686 on the baseline model.
- Score: 4.6193503399184275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The technical report presents our emotion recognition pipeline for
high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts
(A-VB) 2022 Workshop \& Competition. Our proposed method contains three stages.
Firstly, we extract the latent features from the raw audio signal and its
Mel-spectrogram by self-supervised learning methods. Then, the features from
the raw signal are fed to the self-relation attention and temporal awareness
(SA-TA) module for learning the valuable information between these latent
features. Finally, we concatenate all the features and utilize a
fully-connected layer to predict each emotion's score. By empirical
experiments, our proposed method achieves a mean concordance correlation
coefficient (CCC) of 0.7295 on the test set, compared to 0.5686 on the baseline
model. The code of our method is available at
https://github.com/linhtd812/A-VB2022.
Related papers
- EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation [34.24557248359872]
We propose an emotional inertia and contagion-driven dependency modeling approach (EmotionIC) for ERC task.
Our EmotionIC consists of three main components, i.e., Identity Masked Multi-Head Attention (IMMHA), Dialogue-based Gated Recurrent Unit (DiaGRU) and Skip-chain Conditional Random Field (SkipCRF)
Experimental results show that our method can significantly outperform the state-of-the-art models on four benchmark datasets.
arXiv Detail & Related papers (2023-03-20T13:58:35Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Continuous Emotion Recognition using Visual-audio-linguistic
information: A Technical Report for ABAW3 [15.077019278082673]
Cross-modal co-attention model for continuous emotion recognition.
Visual, audio, and linguistic blocks are used to learn the features of the multimodal input.
Cross-validation is carried out on the training and validation set.
arXiv Detail & Related papers (2022-03-24T12:18:06Z) - An Attention-based Method for Action Unit Detection at the 3rd ABAW
Competition [6.229820412732652]
This paper describes our submission to the third Affective Behavior Analysis in-the-wild (ABAW) competition 2022.
We proposed a method for detecting facial action units in the video.
We achieved a macro F1 score of 0.48 on the ABAW challenge validation set compared to 0.39 from the baseline model.
arXiv Detail & Related papers (2022-03-23T14:07:39Z) - Sentiment-Aware Automatic Speech Recognition pre-training for enhanced
Speech Emotion Recognition [11.760166084942908]
We propose a novel multi-task pre-training method for Speech Emotion Recognition (SER)
We pre-train SER model simultaneously on Automatic Speech Recognition (ASR) and sentiment classification tasks.
We generate targets for the sentiment classification using text-to-sentiment model trained on publicly available data.
arXiv Detail & Related papers (2022-01-27T22:20:28Z) - HuBERT: Self-Supervised Speech Representation Learning by Masked
Prediction of Hidden Units [81.53783563025084]
We propose an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
A key ingredient of our approach is applying the prediction loss over the masked regions only.
HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.
arXiv Detail & Related papers (2021-06-14T14:14:28Z) - On the use of Self-supervised Pre-trained Acoustic and Linguistic
Features for Continuous Speech Emotion Recognition [2.294014185517203]
We use wav2vec and camemBERT as self-supervised learned models to represent our data in order to perform continuous emotion recognition from speech.
To the authors' knowledge, this paper presents the first study showing that the joint use of wav2vec and BERT-like pre-trained features is very relevant to deal with continuous SER task.
arXiv Detail & Related papers (2020-11-18T11:10:29Z) - Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using
Affective Cues [75.1731999380562]
We present a learning-based method for detecting real and fake deepfake multimedia content.
We extract and analyze the similarity between the two audio and visual modalities from within the same video.
We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets.
arXiv Detail & Related papers (2020-03-14T22:07:26Z) - EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's
Principle [71.47160118286226]
We present EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images.
Motivated by Frege's Context Principle from psychology, our approach combines three interpretations of context for emotion recognition.
We report an Average Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8 over prior methods.
arXiv Detail & Related papers (2020-03-14T19:55:21Z) - End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.