Multi-Dimensional Hyena for Spatial Inductive Bias
- URL: http://arxiv.org/abs/2309.13600v1
- Date: Sun, 24 Sep 2023 10:22:35 GMT
- Title: Multi-Dimensional Hyena for Spatial Inductive Bias
- Authors: Itamar Zimerman and Lior Wolf
- Abstract summary: We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
- Score: 69.3021852589771
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, Vision Transformers have attracted increasing interest from
computer vision researchers. However, the advantage of these transformers over
CNNs is only fully manifested when trained over a large dataset, mainly due to
the reduced inductive bias towards spatial locality within the transformer's
self-attention mechanism. In this work, we present a data-efficient vision
transformer that does not rely on self-attention. Instead, it employs a novel
generalization to multiple axes of the very recent Hyena layer. We propose
several alternative approaches for obtaining this generalization and delve into
their unique distinctions and considerations from both empirical and
theoretical perspectives.
Our empirical findings indicate that the proposed Hyena N-D layer boosts the
performance of various Vision Transformer architectures, such as ViT, Swin, and
DeiT across multiple datasets. Furthermore, in the small dataset regime, our
Hyena-based ViT is favorable to ViT variants from the recent literature that
are specifically designed for solving the same challenge, i.e., working with
small datasets or incorporating image-specific inductive bias into the
self-attention mechanism. Finally, we show that a hybrid approach that is based
on Hyena N-D for the first layers in ViT, followed by layers that incorporate
conventional attention, consistently boosts the performance of various vision
transformer architectures.
Related papers
- Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - Accelerating Vision Transformers Based on Heterogeneous Attention
Patterns [89.86293867174324]
Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision.
We propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers.
Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput.
arXiv Detail & Related papers (2023-10-11T17:09:19Z) - ViTs are Everywhere: A Comprehensive Study Showcasing Vision
Transformers in Different Domain [0.0]
Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems.
ViTs can overcome several possible difficulties with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2023-10-09T12:31:30Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - A survey of the Vision Transformers and their CNN-Transformer based Variants [0.48163317476588563]
Vision transformers have become popular as a possible substitute to convolutional neural networks (CNNs) for a variety of computer vision applications.
These transformers, with their ability to focus on global relationships in images, offer large learning capacity.
Recently, in vision transformers hybridization of both the convolution operation and self-attention mechanism has emerged, to exploit both the local and global image representations.
arXiv Detail & Related papers (2023-05-17T01:27:27Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.