A Coefficient Makes SVRG Effective
- URL: http://arxiv.org/abs/2311.05589v2
- Date: Mon, 17 Mar 2025 11:14:58 GMT
- Title: A Coefficient Makes SVRG Effective
- Authors: Yida Yin, Zhiqiu Xu, Zhiyuan Li, Trevor Darrell, Zhuang Liu,
- Abstract summary: Variance Reduced Gradient (SVRG) is a theoretically compelling optimization method.<n>In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks.
- Score: 51.36251650664215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlight, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our empirical analysis finds that, for deeper neural networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes models, consistently reducing training loss compared to the baseline and standard SVRG across various model architectures and multiple image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at github.com/davidyyd/alpha-SVRG.
Related papers
- Convergence Analysis of alpha-SVRG under Strong Convexity [17.360026829881487]
variance-reduction technique, alpha-SVRG, allows for fine-grained control of residual noise in learning dynamics.
We show that alpha-SVRG has faster convergence rate compared to SGD and SVRG under suitable choice of alpha.
arXiv Detail & Related papers (2025-03-16T11:17:35Z) - Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study [13.354505458409957]
Graph neural networks (GNNs) are capable of learning on graph-structured data.
The sparsity of graphs results in suboptimal memory access patterns and longer training time.
We show that graph reordering is effective in reducing training time for CPU- and GPU-based training.
arXiv Detail & Related papers (2024-09-17T12:28:02Z) - A Framework for Provably Stable and Consistent Training of Deep
Feedforward Networks [4.21061712600981]
We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios.
This algorithm combines the standard descent gradient and the gradient clipping method.
We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
arXiv Detail & Related papers (2023-05-20T07:18:06Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Orthogonal SVD Covariance Conditioning and Latent Disentanglement [65.67315418971688]
Inserting an SVD meta-layer into neural networks is prone to make the covariance ill-conditioned.
We propose Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR)
Experiments on visual recognition demonstrate that our methods can simultaneously improve covariance conditioning and generalization.
arXiv Detail & Related papers (2022-12-11T20:31:31Z) - Closing the gap between SVRG and TD-SVRG with Gradient Splitting [17.071971639540976]
Temporal difference (TD) learning is a policy evaluation in reinforcement learning whose performance can be enhanced by variance reduction methods.
Recent work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG.
Our main result is a geometric convergence bound with predetermined learning rate of $1/8$, which is identical to the convergence bound available for SVRG in the convex setting.
arXiv Detail & Related papers (2022-11-29T14:21:34Z) - An Empirical Analysis of Recurrent Learning Algorithms In Neural Lossy
Image Compression Systems [73.48927855855219]
Recent advances in deep learning have resulted in image compression algorithms that outperform JPEG and JPEG 2000 on the standard Kodak benchmark.
In this paper, we perform the first large-scale comparison of recent state-of-the-art hybrid neural compression algorithms.
arXiv Detail & Related papers (2022-01-27T19:47:51Z) - Why Approximate Matrix Square Root Outperforms Accurate SVD in Global
Covariance Pooling? [59.820507600960745]
We propose a new GCP meta-layer that uses SVD in the forward pass, and Pad'e Approximants in the backward propagation to compute the gradients.
The proposed meta-layer has been integrated into different CNN models and achieves state-of-the-art performances on both large-scale and fine-grained datasets.
arXiv Detail & Related papers (2021-05-06T08:03:45Z) - SVRG Meets AdaGrad: Painless Variance Reduction [34.42463428418348]
We propose a fully adaptive variant of SVRG, a common VR method.
AdaSVRG uses AdaGrad in the inner loop of SVRG, making it robust to the choice of step-size.
We validate the robustness and effectiveness of AdaSVRG, demonstrating its superior performance over other "tune-free" VR methods.
arXiv Detail & Related papers (2021-02-18T22:26:19Z) - RNN Training along Locally Optimal Trajectories via Frank-Wolfe
Algorithm [50.76576946099215]
We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region.
We develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation.
arXiv Detail & Related papers (2020-10-12T01:59:18Z) - A Novel Neural Network Training Framework with Data Assimilation [2.948167339160823]
A gradient-free training framework based on data assimilation is proposed to avoid the calculation of gradients.
The results show that the proposed training framework performed better than the gradient decent method.
arXiv Detail & Related papers (2020-10-06T11:12:23Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z) - Gradient Centralization: A New Optimization Technique for Deep Neural
Networks [74.935141515523]
gradient centralization (GC) operates directly on gradients by centralizing the gradient vectors to have zero mean.
GC can be viewed as a projected gradient descent method with a constrained loss function.
GC is very simple to implement and can be easily embedded into existing gradient based DNNs with only one line of code.
arXiv Detail & Related papers (2020-04-03T10:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.