Related papers: Hierarchical Space-Time Attention for Micro-Expression Recognition

Hierarchical Space-Time Attention for Micro-Expression Recognition

URL: http://arxiv.org/abs/2405.03202v1
Date: Mon, 6 May 2024 07:02:24 GMT
Title: Hierarchical Space-Time Attention for Micro-Expression Recognition
Authors: Haihong Hao, Shuo Wang, Huixia Ben, Yanbin Hao, Yansong Wang, Weiwei Wang,
Abstract summary: Micro-expression recognition (MER) aims to recognize the short and subtle facial movements from the Micro-expression (ME) video clips, which reveal real emotions. Recent MER methods mostly only utilize special frames from ME video clips or extract optical flow from these special frames. We propose the Hierarchical Space-Time Attention (HSTA) to solve this issue.
Score: 13.85240903497193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Micro-expression recognition (MER) aims to recognize the short and subtle facial movements from the Micro-expression (ME) video clips, which reveal real emotions. Recent MER methods mostly only utilize special frames from ME video clips or extract optical flow from these special frames. However, they neglect the relationship between movements and space-time, while facial cues are hidden within these relationships. To solve this issue, we propose the Hierarchical Space-Time Attention (HSTA). Specifically, we first process ME video frames and special frames or data parallelly by our cascaded Unimodal Space-Time Attention (USTA) to establish connections between subtle facial movements and specific facial areas. Then, we design Crossmodal Space-Time Attention (CSTA) to achieve a higher-quality fusion for crossmodal data. Finally, we hierarchically integrate USTA and CSTA to grasp the deeper facial cues. Our model emphasizes temporal modeling without neglecting the processing of special data, and it fuses the contents in different modalities while maintaining their respective uniqueness. Extensive experiments on the four benchmarks show the effectiveness of our proposed HSTA. Specifically, compared with the latest method on the CASME3 dataset, it achieves about 3% score improvement in seven-category classification.

Related papers

DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding [19.50051728766238]
We propose an innovative video representation method called Dynamic-Image (DynImg)<n>Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects.<n>During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions.
arXiv Detail & Related papers (2025-07-21T12:50:49Z)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z)
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability [58.46310813774538]
Large language models (LMLMs) have made remarkable progress in either temporal or spatial localization. However they struggle to perform-temporal video grounding. This limitation stems from two major challenges. We introduce SpaceLM, a MLLMVL endowed with temporal-temporal video grounding.
arXiv Detail & Related papers (2025-03-18T07:40:36Z)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video. We propose VRS-HQ, an end-to-end video reasoning segmentation approach. Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
Spatio-temporal Transformers for Action Unit Classification with Event Cameras [28.98336123799572]
We present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of RGB videos and event streams. We show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos.
arXiv Detail & Related papers (2024-10-29T11:23:09Z)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z)
Micro-Expression Recognition by Motion Feature Extraction based on Pre-training [6.015288149235598]
We propose a novel motion extraction strategy (MoExt) for the micro-expression recognition task. In MoExt, shape features and texture features are first extracted separately from onset and apex frames, and then motion features related to MEs are extracted based on shape features of both frames. The effectiveness of proposed method is validated on three commonly used datasets.
arXiv Detail & Related papers (2024-07-10T03:51:34Z)
Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation [64.85974098314344]
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images. We propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism.
arXiv Detail & Related papers (2023-09-23T02:40:28Z)
Implicit Temporal Modeling with Learnable Alignment for Video Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance. ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)
GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner. Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z)
MERANet: Facial Micro-Expression Recognition using 3D Residual Attention Network [14.285700243381537]
We propose a facial-expression recognition model using 3D attention called MERANet. The proposed model also encompasses both spatial and temporal information. A superior performance is observed as compared to the state-of-the-art for facial micro-expression recognition.
arXiv Detail & Related papers (2020-12-07T16:41:42Z)
Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction [40.20696709103593]
This paper designs a new mechanism, textiti.e., Dynamic and Static Context-aware Motion Predictor (DSCMP) It integrates rich information into the long-short-term-memory (LSTM) It models the dynamic interactions between agents by learning both their spatial positions and temporal coherence. It captures the context of scene by inferring latent variable, which enables multimodal predictions with meaningful semantic scene layout.
arXiv Detail & Related papers (2020-08-03T11:03:57Z)
Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions. Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.