Related papers: SERE: Exploring Feature Self-relation for Self-supervised Transformer

SERE: Exploring Feature Self-relation for Self-supervised Transformer

URL: http://arxiv.org/abs/2206.05184v3
Date: Mon, 18 Sep 2023 02:04:02 GMT
Title: SERE: Exploring Feature Self-relation for Self-supervised Transformer
Authors: Zhong-Yu Li, Shanghua Gao, Ming-Ming Cheng
Abstract summary: Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
Score: 79.5769147071757
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning representations with self-supervision for convolutional networks (CNN) has been validated to be effective for vision tasks. As an alternative to CNN, vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. Still, most works follow self-supervised strategies designed for CNN, e.g., instance-level discrimination of samples, but they ignore the properties of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks. To enforce this property, we explore the feature SElf-RElation (SERE) for training self-supervised ViT. Specifically, instead of conducting self-supervised learning solely on feature embeddings from multiple views, we utilize the feature self-relations, i.e., spatial/channel self-relations, for self-supervised learning. Self-relation based learning further enhances the relation modeling ability of ViT, resulting in stronger representations that stably improve performance on multiple downstream tasks. Our source code is publicly available at: https://github.com/MCG-NKU/SERE.

Related papers

Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features [3.7165774213454847]
This paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability.<n>We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM)<n>Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability.
arXiv Detail & Related papers (2025-06-26T06:47:51Z)
What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision. This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs. We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z)
Teaching Matters: Investigating the Role of Supervision in Vision Transformers [32.79398665600664]
We show that Vision Transformers (ViTs) learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method.
arXiv Detail & Related papers (2022-12-07T18:59:45Z)
Attention Distillation: self-supervised vision transformer students need more guidance [37.37272042838411]
We study knowledge distillation of self-supervised vision transformers (ViT-SSKD) In experiments on ImageNet-Subset and ImageNet-1K, we show that our method AttnDistill outperforms existing self-supervised knowledge distillation (SSKD) methods. We are also the first to apply the tiny ViT-T model on self-supervised learning.
arXiv Detail & Related papers (2022-10-03T14:01:46Z)
Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning [0.0]
We study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations. Our results show that all methods are effective in learning useful representations and avoiding representational collapse. The encoder pretrained with the temporal order verification task shows the best results across all experiments.
arXiv Detail & Related papers (2022-09-22T10:18:59Z)
Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs) SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token. Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z)
Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)
Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. Are they acting like convolutional networks, or learning entirely different visual representations? We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z)
Refiner: Refining Self-attention for Vision Transformers [85.80887884154427]
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. We introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention.
arXiv Detail & Related papers (2021-06-07T15:24:54Z)
Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.