Related papers: Untangling tradeoffs between recurrence and self-attention in neural networks

Untangling tradeoffs between recurrence and self-attention in neural networks

URL: http://arxiv.org/abs/2006.09471v2
Date: Thu, 10 Dec 2020 09:58:29 GMT
Title: Untangling tradeoffs between recurrence and self-attention in neural networks
Authors: Giancarlo Kerg, Bhargav Kanuparthi, Anirudh Goyal, Kyle Goyette, Yoshua Bengio, Guillaume Lajoie
Abstract summary: We present a formal analysis of how self-attention affects gradient propagation in recurrent networks. We prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies. We propose a relevancy screening mechanism that allows for a scalable use of sparse self-attention with recurrence.
Score: 81.30894993852813
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention and self-attention mechanisms, are now central to state-of-the-art deep learning on sequential tasks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies by establishing concrete bounds for gradient norms. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical experiments to demonstrate the tradeoffs in performance and computational resources by efficiently balancing attention and recurrence. Based on our results, we propose a concrete direction of research to improve scalability of attentive networks.

Related papers

On the Markov Property of Neural Algorithmic Reasoning: Analyses and Methods [94.72563337153268]
We present ForgetNet, which does not use historical embeddings and thus is consistent with the Markov nature of the tasks. We also introduce G-ForgetNet, which uses a gating mechanism to allow for the selective integration of historical embeddings. Our experiments, based on the CLRS-30 algorithmic reasoning benchmark, demonstrate that both ForgetNet and G-ForgetNet achieve better generalization capability than existing methods.
arXiv Detail & Related papers (2024-03-07T22:35:22Z)
Easy attention: A simple attention mechanism for temporal predictions with transformers [2.172584429650463]
We show that the keys, queries and softmax are not necessary for obtaining the attention score required to capture long-term dependencies in temporal sequences. Our proposed easy-attention method directly treats the attention scores as learnable parameters. This approach produces excellent results when reconstructing and predicting the temporal dynamics of chaotic systems.
arXiv Detail & Related papers (2023-08-24T15:54:32Z)
Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training. We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z)
Continual Attentive Fusion for Incremental Learning in Semantic Segmentation [43.98082955427662]
Deep architectures trained with gradient-based techniques suffer from catastrophic forgetting. We introduce a novel attentive feature distillation approach to mitigate catastrophic forgetting. We also introduce a novel strategy to account for the background class in the distillation loss, thus preventing biased predictions.
arXiv Detail & Related papers (2022-02-01T14:38:53Z)
Reducing Catastrophic Forgetting in Self Organizing Maps with Internally-Induced Generative Replay [67.50637511633212]
A lifelong learning agent is able to continually learn from potentially infinite streams of pattern sensory data. One major historic difficulty in building agents that adapt is that neural systems struggle to retain previously-acquired knowledge when learning from new samples. This problem is known as catastrophic forgetting (interference) and remains an unsolved problem in the domain of machine learning to this day.
arXiv Detail & Related papers (2021-12-09T07:11:14Z)
Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks. This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights. We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z)
Schematic Memory Persistence and Transience for Efficient and Robust Continual Learning [8.030924531643532]
Continual learning is considered a promising step towards next-generation Artificial Intelligence (AI) It is still quite primitive, with existing works focusing primarily on avoiding (catastrophic) forgetting. We propose a novel framework for continual learning with external memory that builds on recent advances in neuroscience.
arXiv Detail & Related papers (2021-05-05T14:32:47Z)
Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
Exploiting the Full Capacity of Deep Neural Networks while Avoiding Overfitting by Targeted Sparsity Regularization [1.3764085113103217]
Overfitting is one of the most common problems when training deep neural networks on comparatively small datasets. We propose novel targeted sparsity visualization and regularization strategies to counteract overfitting.
arXiv Detail & Related papers (2020-02-21T11:38:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.