RMES: Real-Time Micro-Expression Spotting Using Phase From Riesz Pyramid
- URL: http://arxiv.org/abs/2305.05523v1
- Date: Tue, 9 May 2023 15:22:18 GMT
- Title: RMES: Real-Time Micro-Expression Spotting Using Phase From Riesz Pyramid
- Authors: Yini Fang, Didan Deng, Liang Wu, Frederic Jumelle, Bertram Shi
- Abstract summary: Micro-expressions (MEs) are involuntary and subtle facial expressions that are thought to reveal feelings people are trying to hide.
Recent works leverage detailed facial motion representations, such as the optical flow, leading to high computational complexity.
We propose RMES, a real-time ME spotting framework, to reduce computational complexity and achieve real-time operation.
- Score: 4.449835214520728
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Micro-expressions (MEs) are involuntary and subtle facial expressions that
are thought to reveal feelings people are trying to hide. ME spotting detects
the temporal intervals containing MEs in videos. Detecting such quick and
subtle motions from long videos is difficult. Recent works leverage detailed
facial motion representations, such as the optical flow, and deep learning
models, leading to high computational complexity. To reduce computational
complexity and achieve real-time operation, we propose RMES, a real-time ME
spotting framework. We represent motion using phase computed by Riesz Pyramid,
and feed this motion representation into a three-stream shallow CNN, which
predicts the likelihood of each frame belonging to an ME. In comparison to
optical flow, phase provides more localized motion estimates, which are
essential for ME spotting, resulting in higher performance. Using phase also
reduces the required computation of the ME spotting pipeline by 77.8%. Despite
its relative simplicity and low computational complexity, our framework
achieves state-of-the-art performance on two public datasets: CAS(ME)2 and SAMM
Long Videos.
Related papers
- Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoning is a new motion understanding task that requires visual answers (video segmentation masks) according to the input question.
This task extends existing grounding work on explicit action/motion grounding to a more general format by enabling implicit reasoning via questions.
We introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA)
arXiv Detail & Related papers (2024-11-15T03:45:09Z) - HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics [32.117677036812836]
HERMES is a model that simulates episodic memory accumulation to capture action sequences.
Episodic COmpressor efficiently aggregates crucial representations from micro to semi-macro levels.
Semantic ReTRiever dramatically reduces feature dimensionality while preserving relevant macro-level information.
arXiv Detail & Related papers (2024-08-30T17:52:55Z) - SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting [11.978551396144532]
In this paper, we propose an efficient framework for facial expression spotting.
First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input sequence within compact sliding windows.
Second, we propose SpotFormer, a multi-scale-temporal Transformer that simultaneously encodes facial-temporal relationships of the SW-MRO features for accurate frame-level probability estimation.
Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions.
arXiv Detail & Related papers (2024-07-30T13:02:08Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - AU-aware graph convolutional network for Macro- and Micro-expression
spotting [44.507747407072685]
We propose a graph convolutional-based network, called Action-Unit-aWare Graph Convolutional Network (AUW-GCN)
To inject prior information and to cope with the problem of small datasets, AU-related statistics are encoded into the network.
Our results outperform baseline methods consistently and achieve new SOTA performance in two benchmark datasets.
arXiv Detail & Related papers (2023-03-16T07:00:36Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z) - Lagrangian Motion Magnification with Double Sparse Optical Flow
Decomposition [2.1028463367241033]
We propose a novel approach for local Lagrangian motion magnification of facial micro-motions.
Our contribution is three-fold: first, we fine tune the recurrent all-pairs field transforms (RAFT) for OFs deep learning approach for faces.
Second, since facial micro-motions are both local in space and time, we propose to approximate the OF field by sparse components both in space and time leading to a double sparse decomposition.
arXiv Detail & Related papers (2022-04-15T20:24:11Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Shallow Optical Flow Three-Stream CNN for Macro- and Micro-Expression
Spotting from Long Videos [15.322908569777551]
We propose a model to predict a score that captures the likelihood of a frame being in an expression interval.
We demonstrate the efficacy and efficiency of the proposed approach on the recent MEGC 2020 benchmark.
arXiv Detail & Related papers (2021-06-11T16:19:48Z) - PAN: Towards Fast Action Recognition via Learning Persistence of
Appearance [60.75488333935592]
Most state-of-the-art methods heavily rely on dense optical flow as motion representation.
In this paper, we shed light on fast action recognition by lifting the reliance on optical flow.
We design a novel motion cue called Persistence of Appearance (PA)
In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries.
arXiv Detail & Related papers (2020-08-08T07:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.