Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in
the Wild
- URL: http://arxiv.org/abs/2205.04749v1
- Date: Tue, 10 May 2022 08:47:15 GMT
- Title: Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in
the Wild
- Authors: Fuyan Ma, Bin Sun, Shutao Li
- Abstract summary: We propose a method for capturing discnative features within each frame model.
We utilize the CNN to translate each frame into a visual feature sequence.
Experiments indicate that our method provides an effective way to make use of the spatial and temporal dependencies.
- Score: 19.5702895176141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous methods for dynamic facial expression in the wild are mainly based
on Convolutional Neural Networks (CNNs), whose local operations ignore the
long-range dependencies in videos. To solve this problem, we propose the
spatio-temporal Transformer (STT) to capture discriminative features within
each frame and model contextual relationships among frames. Spatio-temporal
dependencies are captured and integrated by our unified Transformer.
Specifically, given an image sequence consisting of multiple frames as input,
we utilize the CNN backbone to translate each frame into a visual feature
sequence. Subsequently, the spatial attention and the temporal attention within
each block are jointly applied for learning spatio-temporal representations at
the sequence level. In addition, we propose the compact softmax cross entropy
loss to further encourage the learned features have the minimum intra-class
distance and the maximum inter-class distance. Experiments on two in-the-wild
dynamic facial expression datasets (i.e., DFEW and AFEW) indicate that our
method provides an effective way to make use of the spatial and temporal
dependencies for dynamic facial expression recognition. The source code and the
training logs will be made publicly available.
Related papers
- MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition [4.512502015606517]
We propose a Multi-Scale-temporal CNN-Transformer network (MSSTNet)
Our approach takes spatial different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer)
The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Transformer (T-Former)
arXiv Detail & Related papers (2024-04-12T12:30:48Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial
Expression Recognition [19.5702895176141]
Previous methods for facial expression recognition (DFER) in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos.
We propose Transformer-based methods for DFER to achieve better performances but result in higher FLOPs and computational costs.
Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39K) indicate that our method provides an effective way to make use of the spatial and temporal dependencies for DFER.
arXiv Detail & Related papers (2023-05-05T07:53:13Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Temporal Interpolation Is All You Need for Dynamic Neural Radiance
Fields [4.863916681385349]
We propose a method to train neural fields of dynamic scenes based on temporal vectors of feature.
In the neural representation, we extract from space-time inputs via multiple neural network modules and interpolate them based on time frames.
In the grid representation, space-time features are learned via four-dimensional hash grids, which remarkably reduces training time.
arXiv Detail & Related papers (2023-02-18T12:01:23Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - Video-based Facial Expression Recognition using Graph Convolutional
Networks [57.980827038988735]
We introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based facial expression recognition.
We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0.
arXiv Detail & Related papers (2020-10-26T07:31:51Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.