Related papers: SWAT: Spatial Structure Within and Among Tokens

SWAT: Spatial Structure Within and Among Tokens

URL: http://arxiv.org/abs/2111.13677v3
Date: Mon, 20 Nov 2023 16:37:05 GMT
Title: SWAT: Spatial Structure Within and Among Tokens
Authors: Kumara Kahatapitiya and Michael S. Ryoo
Abstract summary: We argue that models can have significant gains when spatial structure is preserved during tokenization. We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing.
Score: 53.525469741515884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modeling visual data as tokens (i.e., image patches) using attention mechanisms, feed-forward networks or convolutions has been highly effective in recent years. Such methods usually have a common pipeline: a tokenization method, followed by a set of layers/blocks for information mixing, both within and among tokens. When image patches are converted into tokens, they are often flattened, discarding the spatial structure within each patch. As a result, any processing that follows (eg: multi-head self-attention) may fail to recover and/or benefit from such information. In this paper, we argue that models can have significant gains when spatial structure is preserved during tokenization, and is explicitly used during the mixing stage. We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing, both of which can be combined with existing models with minimal effort. We introduce a family of models (SWAT), showing improvements over the likes of DeiT, MLP-Mixer and Swin Transformer, across multiple benchmarks including ImageNet classification and ADE20K segmentation. Our code is available at https://github.com/kkahatapitiya/SWAT.

Related papers

Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models [64.52046218688295]
Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process. We conduct the first in-depth analysis of the role padding tokens play in T2I models. Our findings reveal three distinct scenarios: padding tokens may affect the model's output during text encoding, during the diffusion process, or be effectively ignored.
arXiv Detail & Related papers (2025-01-12T08:36:38Z)
ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis [66.60176118564489]
We show that non-autoregressive Transformers (NATs) can generate decent-quality images in a few steps. We propose EfficientNAT (ENAT), a NAT model that explicitly encourages critical interactions inherent in NATs. ENAT improves the performance of NATs notably with significantly reduced computational cost.
arXiv Detail & Related papers (2024-11-11T13:05:39Z)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs) This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts. We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z)
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z)
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed. By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z)
Rethinking Patch Dependence for Masked Autoencoders [89.02576415930963]
We study the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE)
arXiv Detail & Related papers (2024-01-25T18:49:57Z)
Learning to Embed Time Series Patches Independently [5.752266579415516]
Masked time series modeling has recently gained much attention as a self-supervised representation learning strategy for time series. We argue that capturing such patch might not be an optimal strategy for time series representation learning. We propose to use 1) the simple patch reconstruction task, which autoencode each patch without looking at other patches, and 2) the simple patch-wise reconstruction that embeds each patch independently.
arXiv Detail & Related papers (2023-12-27T06:23:29Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction [9.874357856580447]
We propose a novel transformer network for Unstructured Multiple Images (UMIFormer) It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification. All tokens acquired from various branches are compressed into a fixed-size compact representation.
arXiv Detail & Related papers (2023-02-27T17:27:45Z)
PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy. Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z)
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections. We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.