Module-wise Training of Neural Networks via the Minimizing Movement
Scheme
- URL: http://arxiv.org/abs/2309.17357v3
- Date: Thu, 5 Oct 2023 14:53:57 GMT
- Title: Module-wise Training of Neural Networks via the Minimizing Movement
Scheme
- Authors: Skander Karkar and Ibrahim Ayed and Emmanuel de B\'ezenac and Patrick
Gallinari
- Abstract summary: Greedy layer-wise or module-wise training of neural networks is compelling in constrained and on-device settings where memory is limited.
We propose a module-wise regularization inspired by the minimizing movement scheme for gradient flows in distribution space.
We show improved accuracy of module-wise training of various architectures such as ResNets, Transformers and VGG, when our regularization is added.
- Score: 15.315147138002153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Greedy layer-wise or module-wise training of neural networks is compelling in
constrained and on-device settings where memory is limited, as it circumvents a
number of problems of end-to-end back-propagation. However, it suffers from a
stagnation problem, whereby early layers overfit and deeper layers stop
increasing the test accuracy after a certain depth. We propose to solve this
issue by introducing a module-wise regularization inspired by the minimizing
movement scheme for gradient flows in distribution space. We call the method
TRGL for Transport Regularized Greedy Learning and study it theoretically,
proving that it leads to greedy modules that are regular and that progressively
solve the task. Experimentally, we show improved accuracy of module-wise
training of various architectures such as ResNets, Transformers and VGG, when
our regularization is added, superior to that of other module-wise training
methods and often to end-to-end training, with as much as 60% less memory
usage.
Related papers
- Classifier-guided Gradient Modulation for Enhanced Multimodal Learning [50.7008456698935]
Gradient-Guided Modulation (CGGM) is a novel method to balance multimodal learning with gradients.
We conduct extensive experiments on four multimodal datasets: UPMC-Food 101, CMU-MOSI, IEMOCAP and BraTS.
CGGM outperforms all the baselines and other state-of-the-art methods consistently.
arXiv Detail & Related papers (2024-11-03T02:38:43Z) - Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation [70.43845294145714]
Relieving the reliance of neural network training on a global back-propagation (BP) has emerged as a notable research topic.
We propose a local training strategy that successively regularizes the gradient reconciliation between neighboring modules.
Our method can be integrated into both local-BP and BP-free settings.
arXiv Detail & Related papers (2024-06-07T19:10:31Z) - Take A Shortcut Back: Mitigating the Gradient Vanishing for Training Spiking Neural Networks [15.691263438655842]
Spiking Neural Network (SNN) is a biologically inspired neural network infrastructure that has recently garnered significant attention.
Training an SNN directly poses a challenge due to the undefined gradient of the firing spike process.
We propose a shortcut back-propagation method in our paper, which advocates for transmitting the gradient directly from the loss to the shallow layers.
arXiv Detail & Related papers (2024-01-09T10:54:41Z) - Go beyond End-to-End Training: Boosting Greedy Local Learning with
Context Supply [0.12187048691454236]
greedy local learning partitions the network into gradient-isolated modules and trains supervisely based on local preliminary losses.
As the number of segmentations of the gradient-isolated module increases, the performance of the local learning scheme degrades substantially.
We propose a ContSup scheme, which incorporates context supply between isolated modules to compensate for information loss.
arXiv Detail & Related papers (2023-12-12T10:25:31Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Block-wise Training of Residual Networks via the Minimizing Movement
Scheme [10.342408668490975]
We develop a layer-wise training method, particularly well to ResNets, inspired by the minimizing movement scheme for gradient flows in distribution space.
The method amounts to a kinetic energy regularization of each block that makes the blocks optimal transport maps and endows them with regularity.
It works by alleviating the stagnation problem observed in layer-wise training, whereby greedily-trained early layers overfit and deeper layers stop increasing test accuracy after a certain depth.
arXiv Detail & Related papers (2022-10-03T14:03:56Z) - BackLink: Supervised Local Training with Backward Links [2.104758015212034]
This work proposes a novel local training algorithm, BackLink, which introduces inter- module backward dependency and allows errors to flow between modules.
Our method can lead up to a 79% reduction in memory cost and 52% in simulation runtime in ResNet110 compared to the standard BP.
arXiv Detail & Related papers (2022-05-14T21:49:47Z) - Short-Term Memory Optimization in Recurrent Neural Networks by
Autoencoder-based Initialization [79.42778415729475]
We explore an alternative solution based on explicit memorization using linear autoencoders for sequences.
We show how such pretraining can better support solving hard classification tasks with long sequences.
We show that the proposed approach achieves a much lower reconstruction error for long sequences and a better gradient propagation during the finetuning phase.
arXiv Detail & Related papers (2020-11-05T14:57:16Z) - Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize.
We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.