Meta-attention for ViT-backed Continual Learning
- URL: http://arxiv.org/abs/2203.11684v1
- Date: Tue, 22 Mar 2022 12:58:39 GMT
- Title: Meta-attention for ViT-backed Continual Learning
- Authors: Mengqi Xue, Haofei Zhang, Jie Song, Mingli Song
- Abstract summary: Vision transformers (ViTs) are gradually dominating the field of computer vision.
ViTs can suffer from severe performance degradation if straightforwardly applied to CNN-based continual learning.
We propose MEta-ATtention (MEAT) to adapt a pre-trained ViT to new tasks without sacrificing performance on already learned tasks.
- Score: 35.31816553097367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continual learning is a longstanding research topic due to its crucial role
in tackling continually arriving tasks. Up to now, the study of continual
learning in computer vision is mainly restricted to convolutional neural
networks (CNNs). However, recently there is a tendency that the newly emerging
vision transformers (ViTs) are gradually dominating the field of computer
vision, which leaves CNN-based continual learning lagging behind as they can
suffer from severe performance degradation if straightforwardly applied to
ViTs. In this paper, we study ViT-backed continual learning to strive for
higher performance riding on recent advances of ViTs. Inspired by mask-based
continual learning methods in CNNs, where a mask is learned per task to adapt
the pre-trained ViT to the new task, we propose MEta-ATtention (MEAT), i.e.,
attention to self-attention, to adapt a pre-trained ViT to new tasks without
sacrificing performance on already learned tasks. Unlike prior mask-based
methods like Piggyback, where all parameters are associated with corresponding
masks, MEAT leverages the characteristics of ViTs and only masks a portion of
its parameters. It renders MEAT more efficient and effective with less overhead
and higher accuracy. Extensive experiments demonstrate that MEAT exhibits
significant superiority to its state-of-the-art CNN counterparts, with 4.0~6.0%
absolute boosts in accuracy. Our code has been released at
https://github.com/zju-vipa/MEAT-TIL.
Related papers
- Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked
Autoencoders [32.2455570714414]
Vision Transformers (ViTs) have become ubiquitous in computer vision.
ViTs lack inductive biases, which can make it difficult to train them with limited data.
We propose a technique that enables ViTs to leverage the unique characteristics of both the self-supervised and primary tasks.
arXiv Detail & Related papers (2023-10-31T17:59:07Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - When Adversarial Training Meets Vision Transformers: Recipes from
Training to Architecture [32.260596998171835]
Adrial training is still required for ViTs to defend against such adversarial attacks.
We find that pre-training and SGD are necessary for ViTs' adversarial training.
Our code is available at https://versa.com/mo666666/When-Adrial-Training-Meets-Vision-Transformers.
arXiv Detail & Related papers (2022-10-14T05:37:20Z) - Towards Efficient Adversarial Training on Vision Transformers [41.6396577241957]
Adversarial training is one of the most effective ways to accomplish robust CNNs.
We propose an efficient Attention Guided Adversarial Training mechanism.
With only 65% of the fast adversarial training time, we match the state-of-the-art results on the challenging ImageNet benchmark.
arXiv Detail & Related papers (2022-07-21T14:23:50Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Bootstrapping ViTs: Towards Liberating Vision Transformers from
Pre-training [29.20567759071523]
Vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in computer vision.
This paper introduces CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound.
Experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results.
arXiv Detail & Related papers (2021-12-07T07:56:50Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.