Hierarchical Space-Time Attention for Micro-Expression Recognition
- URL: http://arxiv.org/abs/2405.03202v1
- Date: Mon, 6 May 2024 07:02:24 GMT
- Title: Hierarchical Space-Time Attention for Micro-Expression Recognition
- Authors: Haihong Hao, Shuo Wang, Huixia Ben, Yanbin Hao, Yansong Wang, Weiwei Wang,
- Abstract summary: Micro-expression recognition (MER) aims to recognize the short and subtle facial movements from the Micro-expression (ME) video clips, which reveal real emotions.
Recent MER methods mostly only utilize special frames from ME video clips or extract optical flow from these special frames.
We propose the Hierarchical Space-Time Attention (HSTA) to solve this issue.
- Score: 13.85240903497193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Micro-expression recognition (MER) aims to recognize the short and subtle facial movements from the Micro-expression (ME) video clips, which reveal real emotions. Recent MER methods mostly only utilize special frames from ME video clips or extract optical flow from these special frames. However, they neglect the relationship between movements and space-time, while facial cues are hidden within these relationships. To solve this issue, we propose the Hierarchical Space-Time Attention (HSTA). Specifically, we first process ME video frames and special frames or data parallelly by our cascaded Unimodal Space-Time Attention (USTA) to establish connections between subtle facial movements and specific facial areas. Then, we design Crossmodal Space-Time Attention (CSTA) to achieve a higher-quality fusion for crossmodal data. Finally, we hierarchically integrate USTA and CSTA to grasp the deeper facial cues. Our model emphasizes temporal modeling without neglecting the processing of special data, and it fuses the contents in different modalities while maintaining their respective uniqueness. Extensive experiments on the four benchmarks show the effectiveness of our proposed HSTA. Specifically, compared with the latest method on the CASME3 dataset, it achieves about 3% score improvement in seven-category classification.
Related papers
- Micro-Expression Recognition by Motion Feature Extraction based on Pre-training [6.015288149235598]
We propose a novel motion extraction strategy (MoExt) for the micro-expression recognition task.
In MoExt, shape features and texture features are first extracted separately from onset and apex frames, and then motion features related to MEs are extracted based on shape features of both frames.
The effectiveness of proposed method is validated on three commonly used datasets.
arXiv Detail & Related papers (2024-07-10T03:51:34Z) - FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph
Generation [64.85974098314344]
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video.
Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images.
We propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism.
arXiv Detail & Related papers (2023-09-23T02:40:28Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Short and Long Range Relation Based Spatio-Temporal Transformer for
Micro-Expression Recognition [61.374467942519374]
We propose a novel a-temporal transformer architecture -- to the best of our knowledge, the first purely transformer based approach for micro-expression recognition.
The architecture comprises a spatial encoder which learns spatial patterns, a temporal dimension classification for temporal analysis, and a head.
A comprehensive evaluation on three widely used spontaneous micro-expression data sets, shows that the proposed approach consistently outperforms the state of the art.
arXiv Detail & Related papers (2021-12-10T22:10:31Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - MERANet: Facial Micro-Expression Recognition using 3D Residual Attention
Network [14.285700243381537]
We propose a facial-expression recognition model using 3D attention called MERANet.
The proposed model also encompasses both spatial and temporal information.
A superior performance is observed as compared to the state-of-the-art for facial micro-expression recognition.
arXiv Detail & Related papers (2020-12-07T16:41:42Z) - Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction [40.20696709103593]
This paper designs a new mechanism, textiti.e., Dynamic and Static Context-aware Motion Predictor (DSCMP)
It integrates rich information into the long-short-term-memory (LSTM)
It models the dynamic interactions between agents by learning both their spatial positions and temporal coherence.
It captures the context of scene by inferring latent variable, which enables multimodal predictions with meaningful semantic scene layout.
arXiv Detail & Related papers (2020-08-03T11:03:57Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.