Related papers: Contrastive Weight Regularization for Large Minibatch SGD

Contrastive Weight Regularization for Large Minibatch SGD

URL: http://arxiv.org/abs/2011.08968v1
Date: Tue, 17 Nov 2020 22:07:38 GMT
Title: Contrastive Weight Regularization for Large Minibatch SGD
Authors: Qiwei Yuan, Weizhe Hua, Yi Zhou, Cunxi Yu
Abstract summary: We introduce a novel regularization technique, namely distinctive regularization (DReg) DReg replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse. We empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and improved performance.
Score: 8.927483136015283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The minibatch stochastic gradient descent method (SGD) is widely applied in deep learning due to its efficiency and scalability that enable training deep networks with a large volume of data. Particularly in the distributed setting, SGD is usually applied with large batch size. However, as opposed to small-batch SGD, neural network models trained with large-batch SGD can hardly generalize well, i.e., the validation accuracy is low. In this work, we introduce a novel regularization technique, namely distinctive regularization (DReg), which replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse. The DReg technique introduces very little computation overhead. Moreover, we empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and improved generalization performance. We also demonstrate that DReg can boost the convergence of large-batch SGD with momentum. We believe that DReg can be used as a simple regularization trick to accelerate large-batch training in deep learning.

Related papers

Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum [0.6906005491572401]
gradient descent with momentum (SGDM) has been well studied in both theory and practice. We focus on mini-batch SGDM with constant learning rate and constant momentum weight.
arXiv Detail & Related papers (2025-01-15T15:53:27Z)
No More Adam: Learning Rate Scaling at Initialization is All You Need [13.892699813809857]
SGD-SaI is a simple yet effective enhancement to gradient descent with momentum (SGDM) By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks.
arXiv Detail & Related papers (2024-12-16T13:41:37Z)
Incremental Gauss-Newton Descent for Machine Learning [0.0]
We present a modification of the Gradient Descent algorithm exploiting approximate second-order information based on the Gauss-Newton approach. The new method, which we call Incremental Gauss-Newton Descent (IGND), has essentially the same computational burden as standard SGD. IGND can significantly outperform SGD while performing at least as well as SGD in the worst case.
arXiv Detail & Related papers (2024-08-10T13:52:40Z)
Implicit Bias in Noisy-SGD: With Applications to Differentially Private Training [9.618473763561418]
Training Deep Neural Networks (DNNs) with small batches using Gradient Descent (SGD) yields superior test performance compared to larger batches. DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients. Surprisingly, large-batch training still results in a significant decrease in performance, which poses an important challenge because strong DP guarantees necessitate the use of massive batches.
arXiv Detail & Related papers (2024-02-13T10:19:33Z)
Decentralized SGD and Average-direction SAM are Asymptotically Equivalent [101.37242096601315]
Decentralized gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. Existing theories claim that decentralization invariably generalization.
arXiv Detail & Related papers (2023-06-05T14:19:52Z)
Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized Stochastic Gradient Descent [37.52828820578212]
Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates. Recently, Decentralized Parallel SGD (DPSGD) has been proposed to improve training speed.
arXiv Detail & Related papers (2021-12-02T17:23:25Z)
Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z)
DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training [30.574484395380043]
Decentralized momentum SGD (DmSGD) is more communication efficient than Parallel momentum SGD that incurs global average across all computing nodes. We propose DeLacent large-batch momentum performance models.
arXiv Detail & Related papers (2021-04-24T16:21:01Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training [9.964630991617764]
gradient descent(SGD) and its variants have been the dominating optimization methods in machine learning. In this paper, we propose a simple yet effective method, called normalized gradient descent with momentum(SNGM) for largebatch training.
arXiv Detail & Related papers (2020-07-28T04:34:43Z)
Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step. We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z)
Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning. We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems. We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z)
On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models. We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.