POSTER V2: A simpler and stronger facial expression recognition network
- URL: http://arxiv.org/abs/2301.12149v1
- Date: Sat, 28 Jan 2023 10:23:44 GMT
- Title: POSTER V2: A simpler and stronger facial expression recognition network
- Authors: Jiawei Mao, Rui Xu, Xuesong Yin, Yuanqi Chang, Binling Nie, Aibin
Huang
- Abstract summary: Facial expression recognition (FER) plays an important role in a variety of real-world applications such as human-computer interaction.
POSTER V1 achieves the state-of-the-art (SOTA) performance in FER by effectively combining facial landmark and image features.
In this paper, we propose POSTER V2, which improves POSTER V1 in three directions: cross-fusion, two-stream, and multi-scale feature extraction.
- Score: 8.836565857279052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial expression recognition (FER) plays an important role in a variety of
real-world applications such as human-computer interaction. POSTER V1 achieves
the state-of-the-art (SOTA) performance in FER by effectively combining facial
landmark and image features through two-stream pyramid cross-fusion design.
However, the architecture of POSTER V1 is undoubtedly complex. It causes
expensive computational costs. In order to relieve the computational pressure
of POSTER V1, in this paper, we propose POSTER V2. It improves POSTER V1 in
three directions: cross-fusion, two-stream, and multi-scale feature extraction.
In cross-fusion, we use window-based cross-attention mechanism replacing
vanilla cross-attention mechanism. We remove the image-to-landmark branch in
the two-stream design. For multi-scale feature extraction, POSTER V2 combines
images with landmark's multi-scale features to replace POSTER V1's pyramid
design. Extensive experiments on several standard datasets show that our POSTER
V2 achieves the SOTA FER performance with the minimum computational cost. For
example, POSTER V2 reached 92.21\% on RAF-DB, 67.49\% on AffectNet (7 cls) and
63.77\% on AffectNet (8 cls), respectively, using only 8.4G floating point
operations (FLOPs) and 43.7M parameters (Param). This demonstrates the
effectiveness of our improvements. The code and models are available at
~\url{https://github.com/Talented-Q/POSTER_V2}.
Related papers
- Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task [42.422925759342874]
We propose the Proxy-Tokenized Diffusion Transformer (PT-DiT) to model global visual information efficiently.
Within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region.
We also introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism.
arXiv Detail & Related papers (2024-09-06T03:13:45Z) - Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - ROI-Aware Multiscale Cross-Attention Vision Transformer for Pest Image
Identification [1.9580473532948401]
We propose a novel ROI-aware multiscale cross-attention vision transformer (ROI-ViT)
The proposed ROI-ViT is designed using dual branches, called Pest and ROI branches, which take different types of maps as input: Pest images and ROI maps.
The experimental results show that the proposed ROI-ViT achieves 81.81%, 99.64%, and 84.66% for IP102, D0, and SauTeg pest datasets, respectively.
arXiv Detail & Related papers (2023-12-28T09:16:27Z) - MixVPR: Feature Mixing for Visual Place Recognition [3.6739949215165164]
Visual Place Recognition (VPR) is a crucial part of mobile robotics and autonomous driving.
We introduce MixVPR, a new holistic feature aggregation technique that takes feature maps from pre-trained backbones as a set of global features.
We demonstrate the effectiveness of our technique through extensive experiments on multiple large-scale benchmarks.
arXiv Detail & Related papers (2023-03-03T19:24:03Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression
Recognition [11.525573321175925]
Facial expression recognition (FER) is an important task in computer vision, having practical applications in areas such as human-computer interaction, education, healthcare, and online monitoring.
There are three key issues especially prevalent: inter-class similarity, intra-class discrepancy, and scale sensitivity.
We propose a two-stream Pyramid crOss-fuSion TransformER network (POSTER) that aims to holistically solve all three issues.
arXiv Detail & Related papers (2022-04-08T14:01:41Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.