Related papers: Centered Self-Attention Layers

Centered Self-Attention Layers

URL: http://arxiv.org/abs/2306.01610v1
Date: Fri, 2 Jun 2023 15:19:08 GMT
Title: Centered Self-Attention Layers
Authors: Ameen Ali and Tomer Galanti and Lior Wolf
Abstract summary: The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied. We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers. We present a correction term to the aggregating operator of these mechanisms.
Score: 89.21791761168032
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied within deep learning architectures. We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers for different tokens in transformers and different nodes in graph neural networks. Based on our analysis, we present a correction term to the aggregating operator of these mechanisms. Empirically, this simple term eliminates much of the oversmoothing problem in visual transformers, obtaining performance in weakly supervised segmentation that surpasses elaborate baseline methods that introduce multiple auxiliary networks and training phrases. In graph neural networks, the correction term enables the training of very deep architectures more effectively than many recent solutions to the same problem.

Related papers

Auto-Compressing Networks [51.221103189527014]
We introduce Auto-compression Networks (ACNs), an architectural variant where long feedforward connections from each layer replace traditional short residual connections.<n>We show that ACNs exhibit enhanced noise compared to residual networks, superior performance in low-data settings, and mitigate catastrophic forgetting.<n>These findings establish ACNs as a practical approach to developing efficient neural architectures.
arXiv Detail & Related papers (2025-06-11T13:26:09Z)
The Topos of Transformer Networks [0.6629765271909505]
We provide a theoretical analysis of the expressivity of the transformer architecture through the lens of topos theory. We show that many common neural network architectures can be embedded in a pretopos of piecewise-linear functions, but that the transformer necessarily lives in its topos completion.
arXiv Detail & Related papers (2024-03-27T10:06:33Z)
Graph Neural Networks for Learning Equivariant Representations of Neural Networks [55.04145324152541]
We propose to represent neural networks as computational graphs of parameters. Our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations.
arXiv Detail & Related papers (2024-03-18T18:01:01Z)
An end-to-end attention-based approach for learning on graphs [8.552020965470113]
transformer-based architectures for learning on graphs are motivated by attention as an effective learning mechanism. We propose a purely attention-based approach consisting of an encoder and an attention pooling mechanism. Despite its simplicity, the approach outperforms fine-tuned message passing baselines and recently proposed transformer-based methods on more than 70 node and graph-level tasks.
arXiv Detail & Related papers (2024-02-16T16:20:11Z)
Graph Metanetworks for Processing Diverse Neural Architectures [33.686728709734105]
Graph Metanetworks (GMNs) generalizes to neural architectures where competing methods struggle. We prove that GMNs are expressive and equivariant to parameter permutation symmetries that leave the input neural network functions.
arXiv Detail & Related papers (2023-12-07T18:21:52Z)
Dynamics-aware Adversarial Attack of Adaptive Neural Networks [75.50214601278455]
We investigate the dynamics-aware adversarial attack problem of adaptive neural networks. We propose a Leaded Gradient Method (LGM) and show the significant effects of the lagged gradient. Our LGM achieves impressive adversarial attack performance compared with the dynamic-unaware attack methods.
arXiv Detail & Related papers (2022-10-15T01:32:08Z)
Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization [1.3999481573773072]
We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network. We show that we can train a "vanilla" fully connected network and convolutional neural network with 500 layers by simply adding the batch-entropy regularization term to the loss function.
arXiv Detail & Related papers (2022-08-01T20:31:58Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
An error-propagation spiking neural network compatible with neuromorphic processors [2.432141667343098]
We present a spike-based learning method that approximates back-propagation using local weight update mechanisms. We introduce a network architecture that enables synaptic weight update mechanisms to back-propagate error signals. This work represents a first step towards the design of ultra-low power mixed-signal neuromorphic processing systems.
arXiv Detail & Related papers (2021-04-12T07:21:08Z)
Dynamic Hierarchical Mimicking Towards Consistent Optimization Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability. Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network. Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)
Molecule Property Prediction and Classification with Graph Hypernetworks [113.38181979662288]
We show that the replacement of the underlying networks with hypernetworks leads to a boost in performance. A major difficulty in the application of hypernetworks is their lack of stability. A recent work has tackled the training instability of hypernetworks in the context of error correcting codes.
arXiv Detail & Related papers (2020-02-01T16:44:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.