Related papers: AlterSGD: Finding Flat Minima for Continual Learning by Alternative Training

AlterSGD: Finding Flat Minima for Continual Learning by Alternative Training

URL: http://arxiv.org/abs/2107.05804v1
Date: Tue, 13 Jul 2021 01:43:51 GMT
Title: AlterSGD: Finding Flat Minima for Continual Learning by Alternative Training
Authors: Zhongzhan Huang, Mingfu Liang, Senwei Liang, Wei He
Abstract summary: We propose a simple yet effective optimization method, called AlterSGD, to search for a flat minima in the loss landscape. We prove that such a strategy can encourage the optimization to converge to a flat minima. We verify AlterSGD on continual learning benchmark for semantic segmentation and the empirical results show that we can significantly mitigate the forgetting.
Score: 11.521519687645428
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks suffer from catastrophic forgetting when learning multiple knowledge sequentially, and a growing number of approaches have been proposed to mitigate this problem. Some of these methods achieved considerable performance by associating the flat local minima with forgetting mitigation in continual learning. However, they inevitably need (1) tedious hyperparameters tuning, and (2) additional computational cost. To alleviate these problems, in this paper, we propose a simple yet effective optimization method, called AlterSGD, to search for a flat minima in the loss landscape. In AlterSGD, we conduct gradient descent and ascent alternatively when the network tends to converge at each session of learning new knowledge. Moreover, we theoretically prove that such a strategy can encourage the optimization to converge to a flat minima. We verify AlterSGD on continual learning benchmark for semantic segmentation and the empirical results show that we can significantly mitigate the forgetting and outperform the state-of-the-art methods with a large margin under challenging continual learning protocols.

Related papers

Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation [3.6185342807265415]
It remains an open problem of research to explain the success and the limitations of SGD methods in rigorous theoretical terms. In this work we prove for a large class of SGD methods that the considered does with high probability not converge to global minimizers of the optimization problem. The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods.
arXiv Detail & Related papers (2024-10-14T14:11:37Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
An Accelerated Doubly Stochastic Gradient Method with Faster Explicit Model Identification [97.28167655721766]
We propose a novel doubly accelerated gradient descent (ADSGD) method for sparsity regularized loss minimization problems. We first prove that ADSGD can achieve a linear convergence rate and lower overall computational complexity.
arXiv Detail & Related papers (2022-08-11T22:27:22Z)
Questions for Flat-Minima Optimization of Modern Neural Networks [28.12506392321345]
Two methods for finding flat minima stand out: 1. Averaging methods (i.e. Weight Averaging, SWA) and 2. Minimax methods (i.e. Aware, Sharpness Minimization, SAM) We investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks.
arXiv Detail & Related papers (2022-02-01T18:56:15Z)
Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Minimax Problems [80.46370778277186]
Large scale convex-concave minimax problems arise in numerous applications, including game theory, robust training, and training of generative adversarial networks. We develop a communication-efficient distributed extragrad algorithm, LocalAdaSient, with an adaptive learning rate suitable for solving convex-concave minimax problem in the. Server model. We demonstrate its efficacy through several experiments in both the homogeneous and heterogeneous settings.
arXiv Detail & Related papers (2021-06-18T09:42:05Z)
Tunable Subnetwork Splitting for Model-parallelism of Neural Network Training [12.755664985045582]
We propose a Tunable Subnetwork Splitting Method (TSSM) to tune the decomposition of deep neural networks. Our proposed TSSM can achieve significant speedup without observable loss of training accuracy.
arXiv Detail & Related papers (2020-09-09T01:05:12Z)
Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms. We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework. Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS) Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic [5.395127324484869]
SplitSGD is a new dynamic learning schedule for optimization. The method decreases the learning rate for better adaptation to the local geometry of the objective function. It essentially does not incur additional computational cost than standard SGD.
arXiv Detail & Related papers (2019-10-18T19:38:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.