SERE: Exploring Feature Self-relation for Self-supervised Transformer
- URL: http://arxiv.org/abs/2206.05184v3
- Date: Mon, 18 Sep 2023 02:04:02 GMT
- Title: SERE: Exploring Feature Self-relation for Self-supervised Transformer
- Authors: Zhong-Yu Li, Shanghua Gao, Ming-Ming Cheng
- Abstract summary: Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
- Score: 79.5769147071757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning representations with self-supervision for convolutional networks
(CNN) has been validated to be effective for vision tasks. As an alternative to
CNN, vision transformers (ViT) have strong representation ability with spatial
self-attention and channel-level feedforward networks. Recent works reveal that
self-supervised learning helps unleash the great potential of ViT. Still, most
works follow self-supervised strategies designed for CNN, e.g., instance-level
discrimination of samples, but they ignore the properties of ViT. We observe
that relational modeling on spatial and channel dimensions distinguishes ViT
from other networks. To enforce this property, we explore the feature
SElf-RElation (SERE) for training self-supervised ViT. Specifically, instead of
conducting self-supervised learning solely on feature embeddings from multiple
views, we utilize the feature self-relations, i.e., spatial/channel
self-relations, for self-supervised learning. Self-relation based learning
further enhances the relation modeling ability of ViT, resulting in stronger
representations that stably improve performance on multiple downstream tasks.
Our source code is publicly available at: https://github.com/MCG-NKU/SERE.
Related papers
- What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Teaching Matters: Investigating the Role of Supervision in Vision
Transformers [32.79398665600664]
We show that Vision Transformers (ViTs) learn a diverse range of behaviors in terms of their attention, representations, and downstream performance.
We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads.
Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method.
arXiv Detail & Related papers (2022-12-07T18:59:45Z) - Attention Distillation: self-supervised vision transformer students need
more guidance [37.37272042838411]
We study knowledge distillation of self-supervised vision transformers (ViT-SSKD)
In experiments on ImageNet-Subset and ImageNet-1K, we show that our method AttnDistill outperforms existing self-supervised knowledge distillation (SSKD) methods.
We are also the first to apply the tiny ViT-T model on self-supervised learning.
arXiv Detail & Related papers (2022-10-03T14:01:46Z) - Pretraining the Vision Transformer using self-supervised methods for
vision based Deep Reinforcement Learning [0.0]
We study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations.
Our results show that all methods are effective in learning useful representations and avoiding representational collapse.
The encoder pretrained with the temporal order verification task shows the best results across all experiments.
arXiv Detail & Related papers (2022-09-22T10:18:59Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - Refiner: Refining Self-attention for Vision Transformers [85.80887884154427]
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs.
We introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs.
refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention.
arXiv Detail & Related papers (2021-06-07T15:24:54Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.