A Coefficient Makes SVRG Effective
- URL: http://arxiv.org/abs/2311.05589v1
- Date: Thu, 9 Nov 2023 18:47:44 GMT
- Title: A Coefficient Makes SVRG Effective
- Authors: Yida Yin, Zhiqiu Xu, Zhiyuan Li, Trevor Darrell, Zhuang Liu
- Abstract summary: Variance Reduced Gradient (SVRG) is a theoretically compelling optimization method.
In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks.
Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses.
- Score: 55.104068027239656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang
(2013), is a theoretically compelling optimization method. However, as Defazio
& Bottou (2019) highlights, its effectiveness in deep learning is yet to be
proven. In this work, we demonstrate the potential of SVRG in optimizing
real-world neural networks. Our analysis finds that, for deeper networks, the
strength of the variance reduction term in SVRG should be smaller and decrease
as training progresses. Inspired by this, we introduce a multiplicative
coefficient $\alpha$ to control the strength and adjust it through a linear
decay schedule. We name our method $\alpha$-SVRG. Our results show
$\alpha$-SVRG better optimizes neural networks, consistently reducing training
loss compared to both baseline and the standard SVRG across various
architectures and image classification datasets. We hope our findings encourage
further exploration into variance reduction techniques in deep learning. Code
is available at https://github.com/davidyyd/alpha-SVRG.
Related papers
- Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study [13.354505458409957]
Graph neural networks (GNNs) are capable of learning on graph-structured data.
The sparsity of graphs results in suboptimal memory access patterns and longer training time.
We show that graph reordering is effective in reducing training time for CPU- and GPU-based training.
arXiv Detail & Related papers (2024-09-17T12:28:02Z) - A Framework for Provably Stable and Consistent Training of Deep
Feedforward Networks [4.21061712600981]
We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios.
This algorithm combines the standard descent gradient and the gradient clipping method.
We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
arXiv Detail & Related papers (2023-05-20T07:18:06Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Closing the gap between SVRG and TD-SVRG with Gradient Splitting [17.071971639540976]
Temporal difference (TD) learning is a policy evaluation in reinforcement learning whose performance can be enhanced by variance reduction methods.
Recent work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG.
Our main result is a geometric convergence bound with predetermined learning rate of $1/8$, which is identical to the convergence bound available for SVRG in the convex setting.
arXiv Detail & Related papers (2022-11-29T14:21:34Z) - Why Approximate Matrix Square Root Outperforms Accurate SVD in Global
Covariance Pooling? [59.820507600960745]
We propose a new GCP meta-layer that uses SVD in the forward pass, and Pad'e Approximants in the backward propagation to compute the gradients.
The proposed meta-layer has been integrated into different CNN models and achieves state-of-the-art performances on both large-scale and fine-grained datasets.
arXiv Detail & Related papers (2021-05-06T08:03:45Z) - RNN Training along Locally Optimal Trajectories via Frank-Wolfe
Algorithm [50.76576946099215]
We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region.
We develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation.
arXiv Detail & Related papers (2020-10-12T01:59:18Z) - A Novel Neural Network Training Framework with Data Assimilation [2.948167339160823]
A gradient-free training framework based on data assimilation is proposed to avoid the calculation of gradients.
The results show that the proposed training framework performed better than the gradient decent method.
arXiv Detail & Related papers (2020-10-06T11:12:23Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z) - Gradient Centralization: A New Optimization Technique for Deep Neural
Networks [74.935141515523]
gradient centralization (GC) operates directly on gradients by centralizing the gradient vectors to have zero mean.
GC can be viewed as a projected gradient descent method with a constrained loss function.
GC is very simple to implement and can be easily embedded into existing gradient based DNNs with only one line of code.
arXiv Detail & Related papers (2020-04-03T10:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.