Multi-modal Emotion Estimation for in-the-wild Videos
- URL: http://arxiv.org/abs/2203.13032v3
- Date: Mon, 28 Mar 2022 04:14:32 GMT
- Title: Multi-modal Emotion Estimation for in-the-wild Videos
- Authors: Liyu Meng, Yuchen Liu, Xiaolong Liu, Zhaopei Huang, Wenqiang Jiang,
Tenggan Zhang, Chuanhe Liu and Qin Jin
- Abstract summary: We introduce our submission to the Valence-Arousal Estimation Challenge of the 3rd Affective Behavior Analysis in-the-wild (ABAW) competition.
Our method utilizes the multi-modal information, i.e., the visual and audio information, and employs a temporal encoder to model the temporal context in the videos.
- Score: 40.292523976091964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we briefly introduce our submission to the Valence-Arousal
Estimation Challenge of the 3rd Affective Behavior Analysis in-the-wild (ABAW)
competition. Our method utilizes the multi-modal information, i.e., the visual
and audio information, and employs a temporal encoder to model the temporal
context in the videos. Besides, a smooth processor is applied to get more
reasonable predictions, and a model ensemble strategy is used to improve the
performance of our proposed method. The experiment results show that our method
achieves 65.55% ccc for valence and 70.88% ccc for arousal on the validation
set of the Aff-Wild2 dataset, which prove the effectiveness of our proposed
method.
Related papers
- Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation [9.93719767430551]
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABA6 competition.
We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features.
We employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability.
arXiv Detail & Related papers (2024-03-19T04:25:54Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Ensemble Modeling for Multimodal Visual Action Recognition [50.38638300332429]
We propose an ensemble modeling approach for multimodal action recognition.
We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset.
arXiv Detail & Related papers (2023-08-10T08:43:20Z) - Sample Less, Learn More: Efficient Action Recognition via Frame Feature
Restoration [59.6021678234829]
We propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames.
With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy.
arXiv Detail & Related papers (2023-07-27T13:52:42Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Spatial-temporal Transformer for Affective Behavior Analysis [11.10521339384583]
We propose a Transformer with Multi-Head Attention framework to learn the distribution of both the spatial and temporal features.
The results fully demonstrate the effectiveness of our proposed model based on the Aff-Wild2 dataset.
arXiv Detail & Related papers (2023-03-19T04:34:17Z) - Multi-modal Expression Recognition with Ensemble Method [9.880739481276835]
multimodal feature combinations extracted by several different pre-trained models are applied to capture more effective emotional information.
For these combinations of visual and audio modal features, we utilize two temporal encoders to explore the temporal contextual information in the data.
Our system achieves the average F1 Score of 0.45774 on the validation set.
arXiv Detail & Related papers (2023-03-17T15:03:58Z) - A Multi-modal and Multi-task Learning Method for Action Unit and
Expression Recognition [18.478011167414223]
We introduce a multi-modal and multi-task learning method by using both visual and audio information.
We achieve an AU score of 0.712 and an expression score of 0.477 on the validation set.
arXiv Detail & Related papers (2021-07-09T03:28:17Z) - Technical Report for Valence-Arousal Estimation on Affwild2 Dataset [0.0]
We tackle the valence-arousal estimation challenge from ABAW FG-2020 Competition.
We use MIMAMO Net citedeng 2020mimamo model to achieve information about micro-motion and macro-motion.
arXiv Detail & Related papers (2021-05-04T14:00:07Z) - Delving into 3D Action Anticipation from Streaming Videos [99.0155538452263]
Action anticipation aims to recognize the action with a partial observation.
We introduce several complementary evaluation metrics and present a basic model based on frame-wise action classification.
We also explore multi-task learning strategies by incorporating auxiliary information from two aspects: the full action representation and the class-agnostic action label.
arXiv Detail & Related papers (2019-06-15T10:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.