Understanding Gradient Regularization in Deep Learning: Efficient
Finite-Difference Computation and Implicit Bias
- URL: http://arxiv.org/abs/2210.02720v1
- Date: Thu, 6 Oct 2022 07:12:54 GMT
- Title: Understanding Gradient Regularization in Deep Learning: Efficient
Finite-Difference Computation and Implicit Bias
- Authors: Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa
- Abstract summary: Gradient regularization (GR) is a method that penalizes the norm of the training loss during training.
We show that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR.
We demonstrate that finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima.
- Score: 15.739122088062793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient regularization (GR) is a method that penalizes the gradient norm of
the training loss during training. Although some studies have reported that GR
improves generalization performance in deep learning, little attention has been
paid to it from the algorithmic perspective, that is, the algorithms of GR that
efficiently improve performance. In this study, we first reveal that a specific
finite-difference computation, composed of both gradient ascent and descent
steps, reduces the computational cost for GR. In addition, this computation
empirically achieves better generalization performance. Next, we theoretically
analyze a solvable model, a diagonal linear network, and clarify that GR has a
desirable implicit bias in a certain problem. In particular, learning with the
finite-difference GR chooses better minima as the ascent step size becomes
larger. Finally, we demonstrate that finite-difference GR is closely related to
some other algorithms based on iterative ascent and descent steps for exploring
flat minima: sharpness-aware minimization and the flooding method. We reveal
that flooding performs finite-difference GR in an implicit way. Thus, this work
broadens our understanding of GR in both practice and theory.
Related papers
- On the Generalization Capability of Temporal Graph Learning Algorithms:
Theoretical Insights and a Simpler Method [59.52204415829695]
Temporal Graph Learning (TGL) has become a prevalent technique across diverse real-world applications.
This paper investigates the generalization ability of different TGL algorithms.
We propose a simplified TGL network, which enjoys a small generalization error, improved overall performance, and lower model complexity.
arXiv Detail & Related papers (2024-02-26T08:22:22Z) - Model-Based Reparameterization Policy Gradient Methods: Theory and
Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics.
Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes.
We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z) - Edge-set reduction to efficiently solve the graph partitioning problem
with the genetic algorithm [0.0]
We investigate the impact of modifying the size of the chromosomes on the edge based genetic algorithms (GA)
We show that for big dense instances, the size of the encoding representation becomes too huge and affects GA's efficiency.
arXiv Detail & Related papers (2023-07-19T18:39:15Z) - Implicit regularization in AI meets generalized hardness of
approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks.
We link this to the phenomenon of phase transitions in generalized hardness of approximation.
Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z) - Unifying gradient regularization for Heterogeneous Graph Neural Networks [6.3093033645568015]
We propose a novel gradient regularization method called Grug, which iteratively applies regularization to the gradients generated by both propagated messages and the node features during the message-passing process.
Grug provides a unified framework integrating graph topology and node features, based on which we conduct a detailed theoretical analysis of their effectiveness.
arXiv Detail & Related papers (2023-05-25T07:47:42Z) - Convergence of ease-controlled Random Reshuffling gradient Algorithms under Lipschitz smoothness [0.0]
We consider the average of a very large number of smooth possibly non-size functions, and we use two widely minibatch frameworks to tackle this problem.
We define ease-controlled modifications of IG/RR schemes, which require a light additional computational effort.
We prove our implementation with both a full batch gradient (i.e. L-BFGS) and an implementation of IG/RR methods, proving that algorithms require a similar computational effort.
arXiv Detail & Related papers (2022-12-04T15:26:36Z) - Comprehensive Graph Gradual Pruning for Sparse Training in Graph Neural
Networks [52.566735716983956]
We propose a graph gradual pruning framework termed CGP to dynamically prune GNNs.
Unlike LTH-based methods, the proposed CGP approach requires no re-training, which significantly reduces the computation costs.
Our proposed strategy greatly improves both training and inference efficiency while matching or even exceeding the accuracy of existing methods.
arXiv Detail & Related papers (2022-07-18T14:23:31Z) - RawlsGCN: Towards Rawlsian Difference Principle on Graph Convolutional
Network [102.27090022283208]
Graph Convolutional Network (GCN) plays pivotal roles in many real-world applications.
GCN often exhibits performance disparity with respect to node degrees, resulting in worse predictive accuracy for low-degree nodes.
We formulate the problem of mitigating the degree-related performance disparity in GCN from the perspective of the Rawlsian difference principle.
arXiv Detail & Related papers (2022-02-28T05:07:57Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.