PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers
- URL: http://arxiv.org/abs/2203.11987v2
- Date: Fri, 7 Apr 2023 00:46:43 GMT
- Title: PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers
- Authors: Ryan Grainger, Thomas Paniagua, Xi Song, Naresh Cuntoor, Mun Wai Lee,
Tianfu Wu
- Abstract summary: This paper proposes to learn Patch-to-Cluster attention (PaCa) in Vision Transformers (ViT)
The proposed PaCa module is used in designing efficient and interpretable ViT backbones and semantic segmentation head networks.
It is significantly more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear complexity.
- Score: 9.63371509052453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) are built on the assumption of treating image
patches as ``visual tokens" and learn patch-to-patch attention. The patch
embedding based tokenizer has a semantic gap with respect to its counterpart,
the textual tokenizer. The patch-to-patch attention suffers from the quadratic
complexity issue, and also makes it non-trivial to explain learned ViTs. To
address these issues in ViT, this paper proposes to learn Patch-to-Cluster
attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while
keys and values are directly based on clustering (with a predefined small
number of clusters). The clusters are learned end-to-end, leading to better
tokenizers and inducing joint clustering-for-attention and
attention-for-clustering for better and interpretable models. The quadratic
complexity is relaxed to linear complexity. The proposed PaCa module is used in
designing efficient and interpretable ViT backbones and semantic segmentation
head networks. In experiments, the proposed methods are tested on ImageNet-1k
image classification, MS-COCO object detection and instance segmentation and
MIT-ADE20k semantic segmentation. Compared with the prior art, it obtains
better performance in all the three benchmarks than the SWin and the PVTs by
significant margins in ImageNet-1k and MIT-ADE20k. It is also significantly
more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear
complexity. The learned clusters are semantically meaningful. Code and model
checkpoints are available at https://github.com/iVMCL/PaCaViT.
Related papers
- Accelerating Transformers with Spectrum-Preserving Token Merging [43.463808781808645]
PiToMe prioritizes the preservation of informative tokens using an additional metric termed the energy score.
Experimental findings demonstrate that PiToMe saved from 40-60% FLOPs of the base models.
arXiv Detail & Related papers (2024-05-25T09:37:01Z) - Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding [13.920198434637223]
The tokenizer, as one of the fundamental components of large models, has long been overlooked or even misunderstood in visual tasks.
We design a simple HOmogeneous visual tOKenizer: HOOK.
To achieve homogeneity, the OPM splits the image into 4*4 pixel seeds and then utilizes the attention mechanism to perceive SIRs.
The OVM defines a variable number of learnable vectors as cross-attention queries, allowing for the adjustment of token quantity.
arXiv Detail & Related papers (2024-03-27T14:18:09Z) - S^2MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering [38.35594663863098]
Experimental results on six large-scale multi-view datasets demonstrate that S2MVTC significantly outperforms state-of-the-art algorithms in terms of clustering performance and CPU execution time.
arXiv Detail & Related papers (2024-03-14T05:00:29Z) - SegViTv2: Exploring Efficient and Continual Semantic Segmentation with
Plain Vision Transformers [76.13755422671822]
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework.
We introduce a novel Attention-to-Mask (atm) module to design a lightweight decoder effective for plain ViT.
Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost.
arXiv Detail & Related papers (2023-06-09T22:29:56Z) - Image as Set of Points [60.30495338399321]
Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm.
Our CoCs are convolution- and attention-free, and only rely on clustering algorithm for spatial interaction.
arXiv Detail & Related papers (2023-03-02T18:56:39Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs)
We design a Group Window Attention scheme following the Divide-and-Conquer strategy.
We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z) - So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.