Attribute Surrogates Learning and Spectral Tokens Pooling in
Transformers for Few-shot Learning
- URL: http://arxiv.org/abs/2203.09064v1
- Date: Thu, 17 Mar 2022 03:49:58 GMT
- Title: Attribute Surrogates Learning and Spectral Tokens Pooling in
Transformers for Few-shot Learning
- Authors: Yangji He, Weihan Liang, Dongyang Zhao, Hong-Yu Zhou, Weifeng Ge,
Yizhou Yu, and Wenqiang Zhang
- Abstract summary: Vision transformers have been thought of as a promising alternative to convolutional neural networks for visual recognition.
This paper presents hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling.
HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet.
- Score: 50.95116994162883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents new hierarchically cascaded transformers that can improve
data efficiency through attribute surrogates learning and spectral tokens
pooling. Vision transformers have recently been thought of as a promising
alternative to convolutional neural networks for visual recognition. But when
there is no sufficient data, it gets stuck in overfitting and shows inferior
performance. To improve data efficiency, we propose hierarchically cascaded
transformers that exploit intrinsic image structures through spectral tokens
pooling and optimize the learnable parameters through latent attribute
surrogates. The intrinsic image structure is utilized to reduce the ambiguity
between foreground content and background noise by spectral tokens pooling. And
the attribute surrogate learning scheme is designed to benefit from the rich
visual information in image-label pairs instead of simple visual concepts
assigned by their labels. Our Hierarchically Cascaded Transformers, called
HCTransformers, is built upon a self-supervised learning framework DINO and is
tested on several popular few-shot learning benchmarks.
In the inductive setting, HCTransformers surpass the DINO baseline by a large
margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on
miniImageNet, which demonstrates HCTransformers are efficient to extract
discriminative features. Also, HCTransformers show clear advantages over SOTA
few-shot classification methods in both 5-way 1-shot and 5-way 5-shot settings
on four popular benchmark datasets, including miniImageNet, tieredImageNet,
FC100, and CIFAR-FS. The trained weights and codes are available at
https://github.com/StomachCold/HCTransformers.
Related papers
- SpectFormer: Frequency and Attention is what you need in a Vision
Transformer [28.01996628113975]
Vision transformers have been applied successfully for image recognition tasks.
We hypothesize that both spectral and multi-headed attention plays a major role.
We propose the novel Spectformer architecture for transformers that combines spectral and multi-headed attention layers.
arXiv Detail & Related papers (2023-04-13T12:27:17Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Explicitly Increasing Input Information Density for Vision Transformers
on Small Datasets [26.257612622358614]
Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks.
This paper proposes to explicitly increase the input information density in the frequency domain.
Experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets.
arXiv Detail & Related papers (2022-10-25T20:24:53Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - Visual Transformers: Token-based Image Representation and Processing for
Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context.
Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts.
For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.