From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot
Keypoint Detection
- URL: http://arxiv.org/abs/2304.03140v1
- Date: Thu, 6 Apr 2023 15:22:34 GMT
- Title: From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot
Keypoint Detection
- Authors: Changsheng Lu, Hao Zhu, Piotr Koniusz
- Abstract summary: Few-shot keypoint detection (FSKD) attempts to localize any keypoints, including novel or base keypoints, depending on the reference samples.
FSKD requires the semantically meaningful relations for keypoint similarity learning to overcome the ubiquitous noise and ambiguous local patterns.
We present a novel saliency-guided vision transformer, dubbed SalViT, for few-shot keypoint detection.
- Score: 36.9781808268263
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Unlike current deep keypoint detectors that are trained to recognize limited
number of body parts, few-shot keypoint detection (FSKD) attempts to localize
any keypoints, including novel or base keypoints, depending on the reference
samples. FSKD requires the semantically meaningful relations for keypoint
similarity learning to overcome the ubiquitous noise and ambiguous local
patterns. One rescue comes with vision transformer (ViT) as it captures
long-range relations well. However, ViT may model irrelevant features outside
of the region of interest due to the global attention matrix, thus degrading
similarity learning between support and query features. In this paper, we
present a novel saliency-guided vision transformer, dubbed SalViT, for few-shot
keypoint detection. Our SalViT enjoys a uniquely designed masked self-attention
and a morphology learner, where the former introduces saliency map as a soft
mask to constrain the self-attention on foregrounds, while the latter leverages
the so-called power normalization to adjust morphology of saliency map,
realizing ``dynamically changing receptive field''. Moreover, as salinecy
detectors add computations, we show that attentive masks of DINO transformer
can replace saliency. On top of SalViT, we also investigate i) transductive
FSKD that enhances keypoint representations with unlabelled data and ii) FSKD
under occlusions. We show that our model performs well on five public datasets
and achieves ~10% PCK higher than the normally trained model under severe
occlusions.
Related papers
- MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity [22.058051526676998]
Data-free quantization (DFQ) is a technique that creates a lightweight network from its full-precision counterpart without the original training data, often through a synthetic dataset.
Several DFQ methods have been proposed for vision transformer (ViT) architectures, but they fail to achieve efficacy in low-bit settings.
We propose MimiQ, a novel DFQ method designed for ViTs that focuses on inter-head attention similarity.
arXiv Detail & Related papers (2024-07-29T13:57:40Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Laplacian-Former: Overcoming the Limitations of Vision Transformers in
Local Texture Detection [3.784298636620067]
Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks.
These models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information.
We propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid.
arXiv Detail & Related papers (2023-08-31T19:56:14Z) - Spatial Transform Decoupling for Oriented Object Detection [43.44237345360947]
Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks.
We present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs.
arXiv Detail & Related papers (2023-08-21T08:36:23Z) - Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural
Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task.
Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images.
We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z) - ViT-Calibrator: Decision Stream Calibration for Vision Transformer [49.60474757318486]
We propose a new paradigm dubbed Decision Stream that boosts the performance of general Vision Transformers.
We shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions.
arXiv Detail & Related papers (2023-04-10T02:40:24Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.