SWAT: Spatial Structure Within and Among Tokens
- URL: http://arxiv.org/abs/2111.13677v3
- Date: Mon, 20 Nov 2023 16:37:05 GMT
- Title: SWAT: Spatial Structure Within and Among Tokens
- Authors: Kumara Kahatapitiya and Michael S. Ryoo
- Abstract summary: We argue that models can have significant gains when spatial structure is preserved during tokenization.
We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing.
- Score: 53.525469741515884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling visual data as tokens (i.e., image patches) using attention
mechanisms, feed-forward networks or convolutions has been highly effective in
recent years. Such methods usually have a common pipeline: a tokenization
method, followed by a set of layers/blocks for information mixing, both within
and among tokens. When image patches are converted into tokens, they are often
flattened, discarding the spatial structure within each patch. As a result, any
processing that follows (eg: multi-head self-attention) may fail to recover
and/or benefit from such information. In this paper, we argue that models can
have significant gains when spatial structure is preserved during tokenization,
and is explicitly used during the mixing stage. We propose two key
contributions: (1) Structure-aware Tokenization and, (2) Structure-aware
Mixing, both of which can be combined with existing models with minimal effort.
We introduce a family of models (SWAT), showing improvements over the likes of
DeiT, MLP-Mixer and Swin Transformer, across multiple benchmarks including
ImageNet classification and ADE20K segmentation. Our code is available at
https://github.com/kkahatapitiya/SWAT.
Related papers
- Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - Information Flow Routes: Automatically Interpreting Language Models at Scale [9.156549818722581]
Information flows by routes inside the network via mechanisms implemented in the model.
We build these graphs in a top-down manner, for each prediction leaving only the most important nodes and edges.
We show that some model components can be specialized on domains such as coding or multilingual texts.
arXiv Detail & Related papers (2024-02-27T00:24:42Z) - Learning to Embed Time Series Patches Independently [5.752266579415516]
Masked time series modeling has recently gained much attention as a self-supervised representation learning strategy for time series.
We argue that capturing such patch might not be an optimal strategy for time series representation learning.
We propose to use 1) the simple patch reconstruction task, which autoencode each patch without looking at other patches, and 2) the simple patch-wise reconstruction that embeds each patch independently.
arXiv Detail & Related papers (2023-12-27T06:23:29Z) - Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs.
computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging.
We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z) - UMIFormer: Mining the Correlations between Similar Tokens for Multi-View
3D Reconstruction [9.874357856580447]
We propose a novel transformer network for Unstructured Multiple Images (UMIFormer)
It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification.
All tokens acquired from various branches are compressed into a fixed-size compact representation.
arXiv Detail & Related papers (2023-02-27T17:27:45Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.