Continuous Emotion Recognition using Visual-audio-linguistic
information: A Technical Report for ABAW3
- URL: http://arxiv.org/abs/2203.13031v1
- Date: Thu, 24 Mar 2022 12:18:06 GMT
- Title: Continuous Emotion Recognition using Visual-audio-linguistic
information: A Technical Report for ABAW3
- Authors: Su Zhang, Ruyi An, Yi Ding, Cuntai Guan
- Abstract summary: Cross-modal co-attention model for continuous emotion recognition.
Visual, audio, and linguistic blocks are used to learn the features of the multimodal input.
Cross-validation is carried out on the training and validation set.
- Score: 15.077019278082673
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a cross-modal co-attention model for continuous emotion
recognition using visual-audio-linguistic information. The model consists of
four blocks. The visual, audio, and linguistic blocks are used to learn the
spatial-temporal features of the multimodal input. A co-attention block is
designed to fuse the learned enbeddings with the multihead co-attention
mechanism. The visual encoding from the visual block is concatenated with the
attention feature to emphasize the visual information. To make full use of the
data and alleviate over-fitting, the cross-validation is carried out on the
training and validation set. The concordance correlation coefficient (CCC)
centering is used to merge the results from each fold. The achieved CCC on
validation set is 0.450 for valence and 0.651 for arousal, which significantly
outperforms the baseline method with the corresponding CCC of 0.310 and 0.170,
respectively. The code is available at https://github.com/sucv/ABAW3.
Related papers
- KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Self-Relation Attention and Temporal Awareness for Emotion Recognition
via Vocal Burst [4.6193503399184275]
The technical report presents our emotion recognition pipeline for high-dimensional emotion task (A-VB High) in The ACII Affective Vocal Bursts (A-VB) 2022 Workshop & Competition.
By empirical experiments, our proposed method achieves a mean correlation coefficient (CCC) of 0.7295 on the test set, compared to 0.5686 on the baseline model.
arXiv Detail & Related papers (2022-09-15T22:06:42Z) - Learning Audio-Visual embedding for Wild Person Verification [18.488385598522125]
We propose an audio-visual network that considers aggregator from a fusion perspective.
We introduce improved attentive statistics pooling for the first time in face verification.
Finally, fuse the modality with a gated attention mechanism.
arXiv Detail & Related papers (2022-09-09T02:29:47Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Audio-visual Attentive Fusion for Continuous Emotion Recognition [12.211342881526276]
We propose an audio-visual spatial-temporal deep neural network with: (1) a visual block containing a pretrained 2D-CNN followed by a temporal convolutional network (TCN); (2) an aural block containing several parallel TCNs; and (3) a leader-follower attentive fusion block combining the audio-visual information.
arXiv Detail & Related papers (2021-07-02T16:28:55Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - X-Linear Attention Networks for Image Captioning [124.48670699658649]
We introduce a unified attention block -- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning.
X-LAN integrates X-Linear attention block into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions.
Experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split.
arXiv Detail & Related papers (2020-03-31T10:35:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.