Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for
In-The-Wild Affect Recognition
- URL: http://arxiv.org/abs/2203.13285v1
- Date: Thu, 24 Mar 2022 18:22:56 GMT
- Title: Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for
In-The-Wild Affect Recognition
- Authors: Vincent Karas, Mani Kumar Tellamekala, Adria Mallol-Ragolta, Michel
Valstar, Bj\"orn W. Schuller
- Abstract summary: We present our submission to the 3rd Affective Behavior Analysis in-the-wild (ABAW) challenge.
Recurrence and attention are the two widely used sequence modelling mechanisms in the literature.
We show that LSTM-RNNs can outperform the attention models when coupled with low-complex CNN backbones.
- Score: 4.14099371030604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present our submission to 3rd Affective Behavior Analysis
in-the-wild (ABAW) challenge. Learningcomplex interactions among multimodal
sequences is critical to recognise dimensional affect from in-the-wild
audiovisual data. Recurrence and attention are the two widely used sequence
modelling mechanisms in the literature. To clearly understand the performance
differences between recurrent and attention models in audiovisual affect
recognition, we present a comprehensive evaluation of fusion models based on
LSTM-RNNs, self-attention and cross-modal attention, trained for valence and
arousal estimation. Particularly, we study the impact of some key design
choices: the modelling complexity of CNN backbones that provide features to the
the temporal models, with and without end-to-end learning. We trained the
audiovisual affect recognition models on in-the-wild ABAW corpus by
systematically tuning the hyper-parameters involved in the network architecture
design and training optimisation. Our extensive evaluation of the audiovisual
fusion models shows that LSTM-RNNs can outperform the attention models when
coupled with low-complex CNN backbones and trained in an end-to-end fashion,
implying that attention models may not necessarily be the optimal choice for
continuous-time multimodal emotion recognition.
Related papers
- Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention [3.5803801804085347]
We introduce a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework.
We also explore BLSTMs to improve the temporal modeling of audio-visual feature representations.
Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships.
arXiv Detail & Related papers (2024-03-07T16:57:45Z) - Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense
Interactions through Masked Modeling [24.346868432774453]
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment.
This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models.
We address training early fusion architectures by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion.
We propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions.
arXiv Detail & Related papers (2023-12-02T03:38:49Z) - Mapping EEG Signals to Visual Stimuli: A Deep Learning Approach to Match
vs. Mismatch Classification [28.186129896907694]
We propose a "match-vs-mismatch" deep learning model to classify whether a video clip induces excitatory responses in recorded EEG signals.
We demonstrate that the proposed model is able to achieve the highest accuracy on unseen subjects.
These results have the potential to facilitate the development of neural recording-based video reconstruction.
arXiv Detail & Related papers (2023-09-08T06:37:25Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Recursive Joint Attention for Audio-Visual Fusion in Regression based
Emotion Recognition [15.643176705932396]
In video-based emotion recognition, it is important to leverage the complementary relationship among audio (A) and visual (V) modalities.
In this paper, we investigate the possibility of exploiting the complementary nature of A and V modalities using a joint cross-attention model.
Our model can efficiently leverage both intra- and inter-modal relationships for the fusion of A and V modalities.
arXiv Detail & Related papers (2023-04-17T02:57:39Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Temporal Relevance Analysis for Video Action Models [70.39411261685963]
We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models.
We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected.
arXiv Detail & Related papers (2022-04-25T19:06:48Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - On the benefits of robust models in modulation recognition [53.391095789289736]
Deep Neural Networks (DNNs) using convolutional layers are state-of-the-art in many tasks in communications.
In other domains, like image classification, DNNs have been shown to be vulnerable to adversarial perturbations.
We propose a novel framework to test the robustness of current state-of-the-art models.
arXiv Detail & Related papers (2021-03-27T19:58:06Z) - The Role of Isomorphism Classes in Multi-Relational Datasets [6.419762264544509]
We show that isomorphism leakage overestimates performance in multi-relational inference.
We propose isomorphism-aware synthetic benchmarks for model evaluation.
We also demonstrate that isomorphism classes can be utilised through a simple prioritisation scheme.
arXiv Detail & Related papers (2020-09-30T12:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.