Related papers: Structured Initialization for Vision Transformers

Structured Initialization for Vision Transformers

URL: http://arxiv.org/abs/2505.19985v1
Date: Mon, 26 May 2025 13:42:31 GMT
Title: Structured Initialization for Vision Transformers
Authors: Jianqiao Zheng, Xueqian Li, Hemanth Saratchandran, Simon Lucey,
Abstract summary: We develop a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands.<n>Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN.
Score: 29.32921916396698
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.

Related papers

SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z)
Powerful Design of Small Vision Transformer on CIFAR10 [0.0]
Vision Transformers (ViTs) have demonstrated remarkable success on large-scale datasets, but their performance on smaller datasets often falls short of CNNs.<n>This paper explores the design and optimization of Tiny ViTs for small datasets, using CIFAR-10 as a benchmark.
arXiv Detail & Related papers (2025-01-07T00:41:34Z)
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z)
Structured Initialization for Attention in Vision Transformers [34.374054040300805]
convolutional neural networks (CNNs) have an architectural inductive bias enabling them to perform well on small-scale problems. We argue that the architectural bias inherent to CNNs can be reinterpreted as an initialization bias within ViT. This insight is significant as it empowers ViTs to perform equally well on small-scale problems while maintaining their flexibility for large-scale applications.
arXiv Detail & Related papers (2024-04-01T14:34:47Z)
Convolutional Initialization for Data-Efficient Vision Transformers [38.63299194992718]
Training vision transformer networks on small datasets poses challenges. CNNs can achieve state-of-the-art performance by leveraging their architectural inductive bias. Our approach is motivated by the finding that random impulse filters can achieve almost comparable performance to learned filters in CNNs.
arXiv Detail & Related papers (2024-01-23T06:03:16Z)
Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation [13.056764072568749]
Vision transformers (ViTs) achieve remarkable performance on large datasets, but tend to perform worse than convolutional neural networks (CNNs) on smaller datasets. We propose a module called Local InFormation Enhancer (LIFE) that extracts patch-level local information and incorporates it into the embeddings used in the self-attention block of ViTs. Our proposed module is memory and efficient, as well as flexible enough to process auxiliary tokens such as the classification and distillation tokens.
arXiv Detail & Related papers (2023-05-15T11:23:18Z)
GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks. To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency. We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z)
How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases. We show that self-supervised inductive biases can be learned directly from small-scale datasets. This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures. This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference. We show that the improved robustness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.