Multi-modal Multi-label Facial Action Unit Detection with Transformer
- URL: http://arxiv.org/abs/2203.13301v2
- Date: Mon, 28 Mar 2022 05:17:31 GMT
- Title: Multi-modal Multi-label Facial Action Unit Detection with Transformer
- Authors: Lingfeng Wang, Shisen Wang, Jin Qi
- Abstract summary: This paper describes our submission to the third Affective Behavior Analysis (ABAW) 2022 competition.
We proposed a transfomer based model to detect facial action unit (FAU) in video.
- Score: 7.30287060715476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial Action Coding System is an important approach of facial expression
analysis.This paper describes our submission to the third Affective Behavior
Analysis (ABAW) 2022 competition. We proposed a transfomer based model to
detect facial action unit (FAU) in video. To be specific, we firstly trained a
multi-modal model to extract both audio and visual feature. After that, we
proposed a action units correlation module to learn relationships between each
action unit labels and refine action unit detection result. Experimental
results on validation dataset shows that our method achieves better performance
than baseline model, which verifies that the effectiveness of proposed network.
Related papers
- MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - DiffVein: A Unified Diffusion Network for Finger Vein Segmentation and
Authentication [50.017055360261665]
We introduce DiffVein, a unified diffusion model-based framework which simultaneously addresses vein segmentation and authentication tasks.
For better feature interaction between these two branches, we introduce two specialized modules.
In this way, our framework allows for a dynamic interplay between diffusion and segmentation embeddings.
arXiv Detail & Related papers (2024-02-03T06:49:42Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - An Ensemble Approach for Multiple Emotion Descriptors Estimation Using
Multi-task Learning [12.589338141771385]
This paper illustrates our submission method to the fourth Affective Behavior Analysis in-the-Wild (ABAW) Competition.
Instead of using only face information, we employ full information from a provided dataset containing face and the context around the face.
The proposed system achieves the performance of 0.917 on the MTL Challenge validation dataset.
arXiv Detail & Related papers (2022-07-22T04:57:56Z) - An Attention-based Method for Action Unit Detection at the 3rd ABAW
Competition [6.229820412732652]
This paper describes our submission to the third Affective Behavior Analysis in-the-wild (ABAW) competition 2022.
We proposed a method for detecting facial action units in the video.
We achieved a macro F1 score of 0.48 on the ABAW challenge validation set compared to 0.39 from the baseline model.
arXiv Detail & Related papers (2022-03-23T14:07:39Z) - A Multi-modal and Multi-task Learning Method for Action Unit and
Expression Recognition [18.478011167414223]
We introduce a multi-modal and multi-task learning method by using both visual and audio information.
We achieve an AU score of 0.712 and an expression score of 0.477 on the validation set.
arXiv Detail & Related papers (2021-07-09T03:28:17Z) - Impact of Action Unit Occurrence Patterns on Detection [0.3670422696827526]
We investigate the impact of action unit occurrence patterns on detection of action units.
Our findings suggest that action unit occurrence patterns strongly impact evaluation metrics.
We propose a new approach to explicitly train deep neural networks using the occurrence patterns to boost the accuracy of action unit detection.
arXiv Detail & Related papers (2020-10-15T19:03:05Z) - Asynchronous Interaction Aggregation for Action Detection [43.34864954534389]
We propose the Asynchronous Interaction Aggregation network (AIA) that leverages different interactions to boost action detection.
There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance.
arXiv Detail & Related papers (2020-04-16T07:03:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.