Neural networks with late-phase weights
- URL: http://arxiv.org/abs/2007.12927v4
- Date: Mon, 11 Apr 2022 13:55:43 GMT
- Title: Neural networks with late-phase weights
- Authors: Johannes von Oswald, Seijin Kobayashi, Alexander Meulemans, Christian
Henning, Benjamin F. Grewe, Jo\~ao Sacramento
- Abstract summary: We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
- Score: 66.72777753269658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The largely successful method of training neural networks is to learn their
weights using some variant of stochastic gradient descent (SGD). Here, we show
that the solutions found by SGD can be further improved by ensembling a subset
of the weights in late stages of learning. At the end of learning, we obtain
back a single model by taking a spatial average in weight space. To avoid
incurring increased computational costs, we investigate a family of
low-dimensional late-phase weight models which interact multiplicatively with
the remaining parameters. Our results show that augmenting standard models with
late-phase weights improves generalization in established benchmarks such as
CIFAR-10/100, ImageNet and enwik8. These findings are complemented with a
theoretical analysis of a noisy quadratic problem which provides a simplified
picture of the late phases of neural network learning.
Related papers
- Randomized Forward Mode Gradient for Spiking Neural Networks in Scientific Machine Learning [4.178826560825283]
Spiking neural networks (SNNs) represent a promising approach in machine learning, combining the hierarchical learning capabilities of deep neural networks with the energy efficiency of spike-based computations.
Traditional end-to-end training of SNNs is often based on back-propagation, where weight updates are derived from gradients computed through the chain rule.
This method encounters challenges due to its limited biological plausibility and inefficiencies on neuromorphic hardware.
In this study, we introduce an alternative training approach for SNNs. Instead of using back-propagation, we leverage weight perturbation methods within a forward-mode
arXiv Detail & Related papers (2024-11-11T15:20:54Z) - Generative Feature Training of Thin 2-Layer Networks [0.0]
We consider the approximation of functions by 2-layer neural networks with a small number of hidden weights based on squared loss and small datasets.
As a highly hidden model, we exploit hidden weights with samples from learned distribution proposal.
We refine the sampled weights with gradient-based post-processing in the latent space.
arXiv Detail & Related papers (2024-11-11T10:32:33Z) - HyperSparse Neural Networks: Shifting Exploration to Exploitation
through Adaptive Regularization [18.786142528591355]
Sparse neural networks are a key factor in developing resource-efficient machine learning applications.
We propose the novel and powerful sparse learning method Adaptive Regularized Training (ART) to compress dense into sparse networks.
Our method compresses the pre-trained model knowledge into the weights of highest magnitude.
arXiv Detail & Related papers (2023-08-14T14:18:11Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - Dynamic Neural Diversification: Path to Computationally Sustainable
Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks.
We explore the diversity of the neurons within the hidden layer during the learning process.
We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z) - Inverse-Dirichlet Weighting Enables Reliable Training of Physics
Informed Neural Networks [2.580765958706854]
We describe and remedy a failure mode that may arise from multi-scale dynamics with scale imbalances during training of deep neural networks.
PINNs are popular machine-learning templates that allow for seamless integration of physical equation models with data.
For inverse modeling using sequential training, we find that inverse-Dirichlet weighting protects a PINN against catastrophic forgetting.
arXiv Detail & Related papers (2021-07-02T10:01:37Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - Training Sparse Neural Networks using Compressed Sensing [13.84396596420605]
We develop and test a novel method based on compressed sensing which combines the pruning and training into a single step.
Specifically, we utilize an adaptively weighted $ell1$ penalty on the weights during training, which we combine with a generalization of the regularized dual averaging (RDA) algorithm in order to train sparse neural networks.
arXiv Detail & Related papers (2020-08-21T19:35:54Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.