Contrastive Weight Regularization for Large Minibatch SGD
- URL: http://arxiv.org/abs/2011.08968v1
- Date: Tue, 17 Nov 2020 22:07:38 GMT
- Title: Contrastive Weight Regularization for Large Minibatch SGD
- Authors: Qiwei Yuan, Weizhe Hua, Yi Zhou, Cunxi Yu
- Abstract summary: We introduce a novel regularization technique, namely distinctive regularization (DReg)
DReg replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse.
We empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and improved performance.
- Score: 8.927483136015283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The minibatch stochastic gradient descent method (SGD) is widely applied in
deep learning due to its efficiency and scalability that enable training deep
networks with a large volume of data. Particularly in the distributed setting,
SGD is usually applied with large batch size. However, as opposed to
small-batch SGD, neural network models trained with large-batch SGD can hardly
generalize well, i.e., the validation accuracy is low. In this work, we
introduce a novel regularization technique, namely distinctive regularization
(DReg), which replicates a certain layer of the deep network and encourages the
parameters of both layers to be diverse. The DReg technique introduces very
little computation overhead. Moreover, we empirically show that optimizing the
neural network with DReg using large-batch SGD achieves a significant boost in
the convergence and improved generalization performance. We also demonstrate
that DReg can boost the convergence of large-batch SGD with momentum. We
believe that DReg can be used as a simple regularization trick to accelerate
large-batch training in deep learning.
Related papers
- Incremental Gauss-Newton Descent for Machine Learning [0.0]
We present a modification of the Gradient Descent algorithm exploiting approximate second-order information based on the Gauss-Newton approach.
The new method, which we call Incremental Gauss-Newton Descent (IGND), has essentially the same computational burden as standard SGD.
IGND can significantly outperform SGD while performing at least as well as SGD in the worst case.
arXiv Detail & Related papers (2024-08-10T13:52:40Z) - Implicit Bias in Noisy-SGD: With Applications to Differentially Private
Training [9.618473763561418]
Training Deep Neural Networks (DNNs) with small batches using Gradient Descent (SGD) yields superior test performance compared to larger batches.
DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients.
Surprisingly, large-batch training still results in a significant decrease in performance, which poses an important challenge because strong DP guarantees necessitate the use of massive batches.
arXiv Detail & Related papers (2024-02-13T10:19:33Z) - Decentralized SGD and Average-direction SAM are Asymptotically
Equivalent [101.37242096601315]
Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server.
Existing theories claim that decentralization invariably generalization.
arXiv Detail & Related papers (2023-06-05T14:19:52Z) - Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized
Stochastic Gradient Descent [37.52828820578212]
Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training.
In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates.
Recently, Decentralized Parallel SGD (DPSGD) has been proposed to improve training speed.
arXiv Detail & Related papers (2021-12-02T17:23:25Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training [30.574484395380043]
Decentralized momentum SGD (DmSGD) is more communication efficient than Parallel momentum SGD that incurs global average across all computing nodes.
We propose DeLacent large-batch momentum performance models.
arXiv Detail & Related papers (2021-04-24T16:21:01Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training [9.964630991617764]
gradient descent(SGD) and its variants have been the dominating optimization methods in machine learning.
In this paper, we propose a simple yet effective method, called normalized gradient descent with momentum(SNGM) for largebatch training.
arXiv Detail & Related papers (2020-07-28T04:34:43Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.