Untangling tradeoffs between recurrence and self-attention in neural
networks
- URL: http://arxiv.org/abs/2006.09471v2
- Date: Thu, 10 Dec 2020 09:58:29 GMT
- Title: Untangling tradeoffs between recurrence and self-attention in neural
networks
- Authors: Giancarlo Kerg, Bhargav Kanuparthi, Anirudh Goyal, Kyle Goyette,
Yoshua Bengio, Guillaume Lajoie
- Abstract summary: We present a formal analysis of how self-attention affects gradient propagation in recurrent networks.
We prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies.
We propose a relevancy screening mechanism that allows for a scalable use of sparse self-attention with recurrence.
- Score: 81.30894993852813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention and self-attention mechanisms, are now central to state-of-the-art
deep learning on sequential tasks. However, most recent progress hinges on
heuristic approaches with limited understanding of attention's role in model
optimization and computation, and rely on considerable memory and computational
resources that scale poorly. In this work, we present a formal analysis of how
self-attention affects gradient propagation in recurrent networks, and prove
that it mitigates the problem of vanishing gradients when trying to capture
long-term dependencies by establishing concrete bounds for gradient norms.
Building on these results, we propose a relevancy screening mechanism, inspired
by the cognitive process of memory consolidation, that allows for a scalable
use of sparse self-attention with recurrence. While providing guarantees to
avoid vanishing gradients, we use simple numerical experiments to demonstrate
the tradeoffs in performance and computational resources by efficiently
balancing attention and recurrence. Based on our results, we propose a concrete
direction of research to improve scalability of attentive networks.
Related papers
- On the Markov Property of Neural Algorithmic Reasoning: Analyses and
Methods [94.72563337153268]
We present ForgetNet, which does not use historical embeddings and thus is consistent with the Markov nature of the tasks.
We also introduce G-ForgetNet, which uses a gating mechanism to allow for the selective integration of historical embeddings.
Our experiments, based on the CLRS-30 algorithmic reasoning benchmark, demonstrate that both ForgetNet and G-ForgetNet achieve better generalization capability than existing methods.
arXiv Detail & Related papers (2024-03-07T22:35:22Z) - Easy attention: A simple attention mechanism for temporal predictions with transformers [2.172584429650463]
We show that the keys, queries and softmax are not necessary for obtaining the attention score required to capture long-term dependencies in temporal sequences.
Our proposed easy-attention method directly treats the attention scores as learnable parameters.
This approach produces excellent results when reconstructing and predicting the temporal dynamics of chaotic systems.
arXiv Detail & Related papers (2023-08-24T15:54:32Z) - Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training.
We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z) - Continual Attentive Fusion for Incremental Learning in Semantic
Segmentation [43.98082955427662]
Deep architectures trained with gradient-based techniques suffer from catastrophic forgetting.
We introduce a novel attentive feature distillation approach to mitigate catastrophic forgetting.
We also introduce a novel strategy to account for the background class in the distillation loss, thus preventing biased predictions.
arXiv Detail & Related papers (2022-02-01T14:38:53Z) - Reducing Catastrophic Forgetting in Self Organizing Maps with
Internally-Induced Generative Replay [67.50637511633212]
A lifelong learning agent is able to continually learn from potentially infinite streams of pattern sensory data.
One major historic difficulty in building agents that adapt is that neural systems struggle to retain previously-acquired knowledge when learning from new samples.
This problem is known as catastrophic forgetting (interference) and remains an unsolved problem in the domain of machine learning to this day.
arXiv Detail & Related papers (2021-12-09T07:11:14Z) - Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks.
This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights.
We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z) - Schematic Memory Persistence and Transience for Efficient and Robust
Continual Learning [8.030924531643532]
Continual learning is considered a promising step towards next-generation Artificial Intelligence (AI)
It is still quite primitive, with existing works focusing primarily on avoiding (catastrophic) forgetting.
We propose a novel framework for continual learning with external memory that builds on recent advances in neuroscience.
arXiv Detail & Related papers (2021-05-05T14:32:47Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - Exploiting the Full Capacity of Deep Neural Networks while Avoiding
Overfitting by Targeted Sparsity Regularization [1.3764085113103217]
Overfitting is one of the most common problems when training deep neural networks on comparatively small datasets.
We propose novel targeted sparsity visualization and regularization strategies to counteract overfitting.
arXiv Detail & Related papers (2020-02-21T11:38:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.