Related papers: IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism

IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism

URL: http://arxiv.org/abs/2509.22719v1
Date: Wed, 24 Sep 2025 17:19:23 GMT
Title: IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism
Authors: Adithya Giri,
Abstract summary: In recent years, Transformer-based architectures have become the dominant method for Computer Vision applications.<n>While Transformers are explainable and scale well with dataset size, they lack the inductive biases of Convolutional Neural Networks.<n>We show that introducing these inductive biases through learned masks allow Vision Transformers to learn on much smaller datasets without Knowledge Distillation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, Transformer-based architectures have become the dominant method for Computer Vision applications. While Transformers are explainable and scale well with dataset size, they lack the inductive biases of Convolutional Neural Networks. While these biases may be learned on large datasets, we show that introducing these inductive biases through learned masks allow Vision Transformers to learn on much smaller datasets without Knowledge Distillation. These Transformers, which we call Inductively Biased Image Transformers (IBiT), are significantly more accurate on small datasets, while retaining the explainability Transformers.

Related papers

An Introduction to Transformers [23.915718146956355]
transformer is a neural network component that can be used to learn useful sequences or sets of data-points. In this note we aim for a mathematically precise, intuitive, and clean description of the transformer architecture.
arXiv Detail & Related papers (2023-04-20T14:54:19Z)
Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions. Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear. We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z)
What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention. What makes for a good tokenizer has not been well understood in computer vision. Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z)
Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets [26.257612622358614]
Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks. This paper proposes to explicitly increase the input information density in the frequency domain. Experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets.
arXiv Detail & Related papers (2022-10-25T20:24:53Z)
On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition [18.557920268145818]
Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
arXiv Detail & Related papers (2022-09-15T17:12:30Z)
Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification. We find Vision Transformers perform poorly on a semi-supervised ImageNet setting. CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z)
On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers. The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)
A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.