Centered Self-Attention Layers
- URL: http://arxiv.org/abs/2306.01610v1
- Date: Fri, 2 Jun 2023 15:19:08 GMT
- Title: Centered Self-Attention Layers
- Authors: Ameen Ali and Tomer Galanti and Lior Wolf
- Abstract summary: The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied.
We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers.
We present a correction term to the aggregating operator of these mechanisms.
- Score: 89.21791761168032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The self-attention mechanism in transformers and the message-passing
mechanism in graph neural networks are repeatedly applied within deep learning
architectures. We show that this application inevitably leads to oversmoothing,
i.e., to similar representations at the deeper layers for different tokens in
transformers and different nodes in graph neural networks. Based on our
analysis, we present a correction term to the aggregating operator of these
mechanisms. Empirically, this simple term eliminates much of the oversmoothing
problem in visual transformers, obtaining performance in weakly supervised
segmentation that surpasses elaborate baseline methods that introduce multiple
auxiliary networks and training phrases. In graph neural networks, the
correction term enables the training of very deep architectures more
effectively than many recent solutions to the same problem.
Related papers
- The Topos of Transformer Networks [0.6629765271909505]
We provide a theoretical analysis of the expressivity of the transformer architecture through the lens of topos theory.
We show that many common neural network architectures can be embedded in a pretopos of piecewise-linear functions, but that the transformer necessarily lives in its topos completion.
arXiv Detail & Related papers (2024-03-27T10:06:33Z) - Graph Neural Networks for Learning Equivariant Representations of Neural Networks [55.04145324152541]
We propose to represent neural networks as computational graphs of parameters.
Our approach enables a single model to encode neural computational graphs with diverse architectures.
We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations.
arXiv Detail & Related papers (2024-03-18T18:01:01Z) - Graph Metanetworks for Processing Diverse Neural Architectures [33.686728709734105]
Graph Metanetworks (GMNs) generalizes to neural architectures where competing methods struggle.
We prove that GMNs are expressive and equivariant to parameter permutation symmetries that leave the input neural network functions.
arXiv Detail & Related papers (2023-12-07T18:21:52Z) - Dynamics-aware Adversarial Attack of Adaptive Neural Networks [75.50214601278455]
We investigate the dynamics-aware adversarial attack problem of adaptive neural networks.
We propose a Leaded Gradient Method (LGM) and show the significant effects of the lagged gradient.
Our LGM achieves impressive adversarial attack performance compared with the dynamic-unaware attack methods.
arXiv Detail & Related papers (2022-10-15T01:32:08Z) - Improving the Trainability of Deep Neural Networks through Layerwise
Batch-Entropy Regularization [1.3999481573773072]
We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network.
We show that we can train a "vanilla" fully connected network and convolutional neural network with 500 layers by simply adding the batch-entropy regularization term to the loss function.
arXiv Detail & Related papers (2022-08-01T20:31:58Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - An error-propagation spiking neural network compatible with neuromorphic
processors [2.432141667343098]
We present a spike-based learning method that approximates back-propagation using local weight update mechanisms.
We introduce a network architecture that enables synaptic weight update mechanisms to back-propagate error signals.
This work represents a first step towards the design of ultra-low power mixed-signal neuromorphic processing systems.
arXiv Detail & Related papers (2021-04-12T07:21:08Z) - Dynamic Hierarchical Mimicking Towards Consistent Optimization
Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability.
Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network.
Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z) - Molecule Property Prediction and Classification with Graph Hypernetworks [113.38181979662288]
We show that the replacement of the underlying networks with hypernetworks leads to a boost in performance.
A major difficulty in the application of hypernetworks is their lack of stability.
A recent work has tackled the training instability of hypernetworks in the context of error correcting codes.
arXiv Detail & Related papers (2020-02-01T16:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.