Quaternion Orthogonal Transformer for Facial Expression Recognition in
the Wild
- URL: http://arxiv.org/abs/2303.07831v1
- Date: Tue, 14 Mar 2023 12:07:48 GMT
- Title: Quaternion Orthogonal Transformer for Facial Expression Recognition in
the Wild
- Authors: Yu Zhou, Liyuan Guo, Lianghai Jin
- Abstract summary: We develop a quaternion vision transformer (Q-ViT) for feature classification.
Experimental results on three in-the-wild FER datasets show that the proposed QOT outperforms several state-of-the-art models.
- Score: 3.2898396463438995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial expression recognition (FER) is a challenging topic in artificial
intelligence. Recently, many researchers have attempted to introduce Vision
Transformer (ViT) to the FER task. However, ViT cannot fully utilize emotional
features extracted from raw images and requires a lot of computing resources.
To overcome these problems, we propose a quaternion orthogonal transformer
(QOT) for FER. Firstly, to reduce redundancy among features extracted from
pre-trained ResNet-50, we use the orthogonal loss to decompose and compact
these features into three sets of orthogonal sub-features. Secondly, three
orthogonal sub-features are integrated into a quaternion matrix, which
maintains the correlations between different orthogonal components. Finally, we
develop a quaternion vision transformer (Q-ViT) for feature classification. The
Q-ViT adopts quaternion operations instead of the original operations in ViT,
which improves the final accuracies with fewer parameters. Experimental results
on three in-the-wild FER datasets show that the proposed QOT outperforms
several state-of-the-art models and reduces the computations.
Related papers
- ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions [4.554319452683839]
Vision Transformer (ViT) has achieved significant success in computer vision, but does not perform well in dense prediction tasks.
We present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer.
We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features.
arXiv Detail & Related papers (2024-03-12T07:59:41Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - Key-Value Transformer [47.64219291655723]
Key-value formulation (KV) generates symmetric attention maps, along with an asymmetric version that incorporates a 2D positional encoding into the attention matrix.
Experiments encompass three task types -- synthetics (such as reversing or sorting a list), vision (mnist or cifar classification), and NLP.
arXiv Detail & Related papers (2023-05-28T20:26:06Z) - Short Range Correlation Transformer for Occluded Person
Re-Identification [4.339510167603376]
We propose a partial feature transformer-based person re-identification framework named PFT.
The proposed PFT utilizes three modules to enhance the efficiency of vision transformer.
Experimental results over occluded and holistic re-identification datasets demonstrate that the proposed PFT network achieves superior performance consistently.
arXiv Detail & Related papers (2022-01-04T11:12:39Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Contrastive Triple Extraction with Generative Transformer [72.21467482853232]
We introduce a novel model, contrastive triple extraction with a generative transformer.
Specifically, we introduce a single shared transformer module for encoder-decoder-based generation.
To generate faithful results, we propose a novel triplet contrastive training object.
arXiv Detail & Related papers (2020-09-14T05:29:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.