Mitigating Over-smoothing in Transformers via Regularized Nonlocal
Functionals
- URL: http://arxiv.org/abs/2312.00751v1
- Date: Fri, 1 Dec 2023 17:52:47 GMT
- Title: Mitigating Over-smoothing in Transformers via Regularized Nonlocal
Functionals
- Authors: Tam Nguyen, Tan M. Nguyen, Richard G. Baraniuk
- Abstract summary: We show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity.
We propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens.
We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations.
- Score: 31.328766460487355
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformers have achieved remarkable success in a wide range of natural
language processing and computer vision applications. However, the
representation capacity of a deep transformer model is degraded due to the
over-smoothing issue in which the token representations become identical when
the model's depth grows. In this work, we show that self-attention layers in
transformers minimize a functional which promotes smoothness, thereby causing
token uniformity. We then propose a novel regularizer that penalizes the norm
of the difference between the smooth output tokens from self-attention and the
input tokens to preserve the fidelity of the tokens. Minimizing the resulting
regularized energy functional, we derive the Neural Transformer with a
Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models
that can mitigate the over-smoothing issue. We empirically demonstrate the
advantages of NeuTRENO over the baseline transformers and state-of-the-art
methods in reducing the over-smoothing of token representations on various
practical tasks, including object classification, image segmentation, and
language modeling.
Related papers
- CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - What Makes for Good Tokenizers in Vision Transformer? [62.44987486771936]
transformers are capable of extracting their pairwise relationships using self-attention.
What makes for a good tokenizer has not been well understood in computer vision.
Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization.
Regularization objective TokenProp is embraced in the standard training regime.
arXiv Detail & Related papers (2022-12-21T15:51:43Z) - Addressing Token Uniformity in Transformers via Singular Value
Transformation [24.039280291845706]
Token uniformity is commonly observed in transformer-based models.
We show that a less skewed singular value distribution can alleviate the token uniformity' problem.
arXiv Detail & Related papers (2022-08-24T22:44:09Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.