Related papers: What Makes for Good Tokenizers in Vision Transformer?

What Makes for Good Tokenizers in Vision Transformer?

URL: http://arxiv.org/abs/2212.11115v1
Date: Wed, 21 Dec 2022 15:51:43 GMT
Title: What Makes for Good Tokenizers in Vision Transformer?
Authors: Shengju Qian, Yi Zhu, Wenbo Li, Mu Li, Jiaya Jia
Abstract summary: transformers are capable of extracting their pairwise relationships using self-attention. What makes for a good tokenizer has not been well understood in computer vision. Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Regularization objective TokenProp is embraced in the standard training regime.
Score: 62.44987486771936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm. Relying on the tokenization process that splits inputs into multiple tokens, transformers are capable of extracting their pairwise relationships using self-attention. While being the stemming building block of transformers, what makes for a good tokenizer has not been well understood in computer vision. In this work, we investigate this uncharted problem from an information trade-off perspective. In addition to unifying and understanding existing structural modifications, our derivation leads to better design strategies for vision tokenizers. The proposed Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Furthermore, a regularization objective TokenProp is embraced in the standard training regime. Through extensive experiments on various transformer architectures, we observe both improved performance and intriguing properties of these two plug-and-play designs with negligible computational overhead. These observations further indicate the importance of the commonly-omitted designs of tokenizers in vision transformer.

Related papers

Disentangling Visual Transformers: Patch-level Interpretability for Image Classification [2.899118947717404]
We propose Hindered Transformer (HiT), a novel interpretable by design architecture inspired by visual transformers. HiT can be interpreted as a linear combination of patch-level information. We show that the advantages of our approach in terms of explicability come with a reasonable trade-off in performance.
arXiv Detail & Related papers (2025-02-24T14:30:29Z)
Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer [16.97186100288621]
Vision Transformers extract visual information by representing regions as transformed tokens and integrating them via attention weights. Existing post-hoc explanation methods merely consider these attention weights, neglecting crucial information from the transformed tokens. We propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects.
arXiv Detail & Related papers (2024-03-21T16:52:27Z)
Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals [31.328766460487355]
We show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations.
arXiv Detail & Related papers (2023-12-01T17:52:47Z)
Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z)
Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions. Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear. We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z)
Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks. We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z)
Blending Anti-Aliasing into Vision Transformer [57.88274087198552]
discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps. Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions. We propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue.
arXiv Detail & Related papers (2021-10-28T14:30:02Z)
Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer) It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.