SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting
- URL: http://arxiv.org/abs/2407.20799v1
- Date: Tue, 30 Jul 2024 13:02:08 GMT
- Title: SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting
- Authors: Yicheng Deng, Hideaki Hayashi, Hajime Nagahara,
- Abstract summary: In this paper, we propose an efficient framework for facial expression spotting.
First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input sequence within compact sliding windows.
Second, we propose SpotFormer, a multi-scale-temporal Transformer that simultaneously encodes facial-temporal relationships of the SW-MRO features for accurate frame-level probability estimation.
Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions.
- Score: 11.978551396144532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding severe head movement problems. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, our proposed Facial Local Graph Pooling (FLGP) and convolutional layers are applied for multi-scale spatio-temporal feature extraction. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV and CAS(ME)^2 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.
Related papers
- MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.
Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.
We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z) - Multi-Scale Spatio-Temporal Graph Convolutional Network for Facial Expression Spotting [11.978551396144532]
We propose a Multi-Scale Spatio-Temporal Graph Conal Network (SpoT-CN) for facial expression spotting.
We track both short- and long-term motion of facial muscles in compact sliding windows whose window length adapts to the temporal receptive field of the network.
This network learns both local and global features from multiple scales of facial graph structures using our proposed facial localvolution graph pooling (FLGP)
arXiv Detail & Related papers (2024-03-24T03:10:39Z) - SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection
with Multimodal Large Language Models [63.946809247201905]
We introduce a new benchmark, namely SHIELD, to evaluate the ability of MLLMs on face spoofing and forgery detection.
We design true/false and multiple-choice questions to evaluate multimodal face data in these two face security tasks.
The results indicate that MLLMs hold substantial potential in the face security domain.
arXiv Detail & Related papers (2024-02-06T17:31:36Z) - Improving Vision Anomaly Detection with the Guidance of Language
Modality [64.53005837237754]
This paper tackles the challenges for vision modality from a multimodal point of view.
We propose Cross-modal Guidance (CMG) to tackle the redundant information issue and sparse space issue.
To learn a more compact latent space for the vision anomaly detector, CMLE learns a correlation structure matrix from the language modality.
arXiv Detail & Related papers (2023-10-04T13:44:56Z) - Multi-Modal Mutual Attention and Iterative Interaction for Referring
Image Segmentation [49.6153714376745]
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression.
We propose Multi-Modal Mutual Attention ($mathrmM3Att$) and Multi-Modal Mutual Decoder ($mathrmM3Dec$) that better fuse information from the two input modalities.
arXiv Detail & Related papers (2023-05-24T16:26:05Z) - Multi-scale multi-modal micro-expression recognition algorithm based on
transformer [17.980579727286518]
A micro-expression is a spontaneous unconscious facial muscle movement that can reveal the true emotions people attempt to hide.
We propose a multi-modal multi-scale algorithm based on transformer network to learn local multi-grained features of micro-expressions.
The results show the accuracy of the proposed algorithm in single measurement SMIC database is up to 78.73% and the F1 value on CASMEII of the combined database is up to 0.9071.
arXiv Detail & Related papers (2023-01-08T03:45:23Z) - Lagrangian Motion Magnification with Double Sparse Optical Flow
Decomposition [2.1028463367241033]
We propose a novel approach for local Lagrangian motion magnification of facial micro-motions.
Our contribution is three-fold: first, we fine tune the recurrent all-pairs field transforms (RAFT) for OFs deep learning approach for faces.
Second, since facial micro-motions are both local in space and time, we propose to approximate the OF field by sparse components both in space and time leading to a double sparse decomposition.
arXiv Detail & Related papers (2022-04-15T20:24:11Z) - End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos.
We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure.
We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z) - Shallow Optical Flow Three-Stream CNN for Macro- and Micro-Expression
Spotting from Long Videos [15.322908569777551]
We propose a model to predict a score that captures the likelihood of a frame being in an expression interval.
We demonstrate the efficacy and efficiency of the proposed approach on the recent MEGC 2020 benchmark.
arXiv Detail & Related papers (2021-06-11T16:19:48Z) - AOT: Appearance Optimal Transport Based Identity Swapping for Forgery
Detection [76.7063732501752]
We provide a new identity swapping algorithm with large differences in appearance for face forgery detection.
The appearance gaps mainly arise from the large discrepancies in illuminations and skin colors.
A discriminator is introduced to distinguish the fake parts from a mix of real and fake image patches.
arXiv Detail & Related papers (2020-11-05T06:17:04Z) - Micro-Facial Expression Recognition Based on Deep-Rooted Learning
Algorithm [0.0]
An effective Micro-Facial Expression Based Deep-Rooted Learning (MFEDRL) classifier is proposed in this paper.
The performance of the algorithm will be evaluated using recognition rate and false measures.
arXiv Detail & Related papers (2020-09-12T12:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.