ViR:the Vision Reservoir
- URL: http://arxiv.org/abs/2112.13545v2
- Date: Wed, 29 Dec 2021 06:30:56 GMT
- Title: ViR:the Vision Reservoir
- Authors: Xian Wei, Bin Wang, Mingsong Chen, Ji Yuan, Hai Lan, Jiehuang Shi,
Xuan Tang, Bo Jin, Guozhang Chen, Dongping Yang
- Abstract summary: Vision Reservoir computing (ViR) is proposed here for image classification, as a parallel to Vision Transformer (ViT)
By splitting each image into a sequence of tokens with fixed length, the ViR constructs a pure reservoir with a nearly fully connected topology to replace the Transformer module in ViT.
The number of parameters of the ViR is about 15% even 5% of the ViT, and the memory footprint is about 20% to 40% of the ViT.
- Score: 10.881974985012839
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The most recent year has witnessed the success of applying the Vision
Transformer (ViT) for image classification. However, there are still evidences
indicating that ViT often suffers following two aspects, i) the high
computation and the memory burden from applying the multiple Transformer layers
for pre-training on a large-scale dataset, ii) the over-fitting when training
on small datasets from scratch. To address these problems, a novel method,
namely, Vision Reservoir computing (ViR), is proposed here for image
classification, as a parallel to ViT. By splitting each image into a sequence
of tokens with fixed length, the ViR constructs a pure reservoir with a nearly
fully connected topology to replace the Transformer module in ViT. Two kinds of
deep ViR models are subsequently proposed to enhance the network performance.
Comparative experiments between the ViR and the ViT are carried out on several
image classification benchmarks. Without any pre-training process, the ViR
outperforms the ViT in terms of both model and computational complexity.
Specifically, the number of parameters of the ViR is about 15% even 5% of the
ViT, and the memory footprint is about 20% to 40% of the ViT. The superiority
of the ViR performance is explained by Small-World characteristics, Lyapunov
exponents, and memory capacity.
Related papers
- ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - RaViTT: Random Vision Transformer Tokens [0.41776442767736593]
Vision Transformers (ViTs) have successfully been applied to image classification problems where large annotated datasets are available.
We propose Random Vision Transformer Tokens (RaViTT), a random patch sampling strategy that can be incorporated into existing ViTs.
arXiv Detail & Related papers (2023-06-19T14:24:59Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Discrete Representations Strengthen Vision Transformer Robustness [43.821734467553554]
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder.
Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
arXiv Detail & Related papers (2021-11-20T01:49:56Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.