Vision Transformer Equipped with Neural Resizer on Facial Expression
Recognition Task
- URL: http://arxiv.org/abs/2204.02181v1
- Date: Tue, 5 Apr 2022 13:04:04 GMT
- Title: Vision Transformer Equipped with Neural Resizer on Facial Expression
Recognition Task
- Authors: Hyeonbin Hwang, Soyeon Kim, Wei-Jin Park, Jiho Seo, Kyungtae Ko, Hyeon
Yeo
- Abstract summary: We propose a novel training framework, Neural Resizer, to support Transformer by compensating information and downscaling in a data-driven manner.
Experiments show our Neural Resizer with F-PDLS loss function improves the performance with Transformer variants in general.
- Score: 1.3048920509133808
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When it comes to wild conditions, Facial Expression Recognition is often
challenged with low-quality data and imbalanced, ambiguous labels. This field
has much benefited from CNN based approaches; however, CNN models have
structural limitation to see the facial regions in distant. As a remedy,
Transformer has been introduced to vision fields with global receptive field,
but requires adjusting input spatial size to the pretrained models to enjoy
their strong inductive bias at hands. We herein raise a question whether using
the deterministic interpolation method is enough to feed low-resolution data to
Transformer. In this work, we propose a novel training framework, Neural
Resizer, to support Transformer by compensating information and downscaling in
a data-driven manner trained with loss function balancing the noisiness and
imbalance. Experiments show our Neural Resizer with F-PDLS loss function
improves the performance with Transformer variants in general and nearly
achieves the state-of-the-art performance.
Related papers
- Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution [6.367865391518726]
Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR)
To activate more input pixels globally, hybrid attention models have been proposed.
We employ wavelet losses to train Transformer models to improve quantitative and subjective performance.
arXiv Detail & Related papers (2024-04-17T11:25:19Z) - In Search of a Data Transformation That Accelerates Neural Field Training [37.39915075581319]
We focus on how permuting pixel locations affect the convergence speed of SGD.
Counterly, we find that randomly permuting the pixel locations can considerably accelerate the training.
Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns, which hinder easy optimization in the early stage but capture fine details of the signal.
arXiv Detail & Related papers (2023-11-28T06:17:49Z) - Local Distortion Aware Efficient Transformer Adaptation for Image
Quality Assessment [62.074473976962835]
We show that with proper injection of local distortion features, a larger pretrained and fixed foundation model performs better in IQA tasks.
Specifically, for the lack of local distortion structure and inductive bias of vision transformer (ViT), we use another pretrained convolution neural network (CNN)
We propose a local distortion extractor to obtain local distortion features from the pretrained CNN and a local distortion injector to inject the local distortion features into ViT.
arXiv Detail & Related papers (2023-08-23T08:41:21Z) - Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics.
By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information.
One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z) - Probabilistic Spatial Transformer Networks [0.6999740786886537]
We propose a probabilistic extension that estimates a transformation rather than a deterministic one.
We show that these two properties lead to improved classification performance, robustness and model calibration.
We further demonstrate that the approach generalizes to non-visual domains by improving model performance on time-series data.
arXiv Detail & Related papers (2020-04-07T18:22:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.