Related papers: The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention

The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention

URL: http://arxiv.org/abs/2202.05798v1
Date: Fri, 11 Feb 2022 17:49:22 GMT
Title: The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention
Authors: Kazuki Irie, R\'obert Csord\'as, J\"urgen Schmidhuber
Abstract summary: Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system. No prior work has effectively studied the operations of NNs in such a form. We conduct experiments on small scale supervised image classification tasks in single-task, multi-task, and continual learning settings.
Score: 8.131130865777344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and produces outputs using unnormalised dot attention over the entire training experience. While this has been technically known since the '60s, no prior work has effectively studied the operations of NNs in such a form, presumably due to prohibitive time and space complexities and impractical model sizes, all of them growing linearly with the number of training patterns which may get very large. However, this dual formulation offers a possibility of directly visualizing how an NN makes use of training patterns at test time, by examining the corresponding attention weights. We conduct experiments on small scale supervised image classification tasks in single-task, multi-task, and continual learning settings, as well as language modelling, and discuss potentials and limits of this view for better understanding and interpreting how NNs exploit training patterns. Our code is public.

Related papers

Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning. Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z)
Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks [69.38572074372392]
We present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks.
arXiv Detail & Related papers (2023-07-13T16:39:08Z)
IF2Net: Innately Forgetting-Free Networks for Continual Learning [49.57495829364827]
Continual learning can incrementally absorb new concepts without interfering with previously learned knowledge. Motivated by the characteristics of neural networks, we investigated how to design an Innately Forgetting-Free Network (IF2Net) IF2Net allows a single network to inherently learn unlimited mapping rules without telling task identities at test time.
arXiv Detail & Related papers (2023-06-18T05:26:49Z)
How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series. We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z)
Continual Learning with Invertible Generative Models [15.705568893476947]
Catastrophic forgetting (CF) happens whenever a neural network overwrites past knowledge while being trained on new tasks. We propose a novel method that combines the strengths of regularization and generative-based rehearsal approaches.
arXiv Detail & Related papers (2022-02-11T15:28:30Z)
Rethinking Nearest Neighbors for Visual Classification [56.00783095670361]
k-NN is a lazy learning method that aggregates the distance between the test image and top-k neighbors in a training set. We adopt k-NN with pre-trained visual representations produced by either supervised or self-supervised methods in two steps. Via extensive experiments on a wide range of classification tasks, our study reveals the generality and flexibility of k-NN integration.
arXiv Detail & Related papers (2021-12-15T20:15:01Z)
Training Deep Neural Networks with Joint Quantization and Pruning of Weights and Activations [5.17729871332369]
State-of-the-art quantization techniques are currently applied to both the weights and activations of deep neural networks. In this work, we jointly apply novel uniform quantization and unstructured pruning methods to both the weights and activations of deep neural networks during training.
arXiv Detail & Related papers (2021-10-15T16:14:36Z)
Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z)
The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning. We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z)
On the interplay between physical and content priors in deep learning for computational imaging [5.486833154281385]
We use the Phase Extraction Neural Network (PhENN) for quantitative phase retrieval in a lensless phase imaging system. We show that the two questions are related and share a common crux: the choice of the training examples. We also discover that weaker regularization effect leads to better learning of the underlying propagation model.
arXiv Detail & Related papers (2020-04-14T08:36:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.