Related papers: Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

URL: http://arxiv.org/abs/2210.02720v1
Date: Thu, 6 Oct 2022 07:12:54 GMT
Title: Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias
Authors: Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa
Abstract summary: Gradient regularization (GR) is a method that penalizes the norm of the training loss during training. We show that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. We demonstrate that finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima.
Score: 15.739122088062793
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. Although some studies have reported that GR improves generalization performance in deep learning, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. In addition, this computation empirically achieves better generalization performance. Next, we theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias in a certain problem. In particular, learning with the finite-difference GR chooses better minima as the ascent step size becomes larger. Finally, we demonstrate that finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima: sharpness-aware minimization and the flooding method. We reveal that flooding performs finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR in both practice and theory.

Related papers

Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling [44.31966204357333]
We develop memory-efficient algorithms for large-scale machine learning problems. We use two key techniques to make our approach memory-efficient and avoid full computations.
arXiv Detail & Related papers (2025-02-20T15:37:45Z)
On the Generalization Capability of Temporal Graph Learning Algorithms: Theoretical Insights and a Simpler Method [59.52204415829695]
Temporal Graph Learning (TGL) has become a prevalent technique across diverse real-world applications. This paper investigates the generalization ability of different TGL algorithms. We propose a simplified TGL network, which enjoys a small generalization error, improved overall performance, and lower model complexity.
arXiv Detail & Related papers (2024-02-26T08:22:22Z)
Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes. We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z)
Edge-set reduction to efficiently solve the graph partitioning problem with the genetic algorithm [0.0]
We investigate the impact of modifying the size of the chromosomes on the edge based genetic algorithms (GA) We show that for big dense instances, the size of the encoding representation becomes too huge and affects GA's efficiency.
arXiv Detail & Related papers (2023-07-19T18:39:15Z)
Implicit regularization in AI meets generalized hardness of approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks. We link this to the phenomenon of phase transitions in generalized hardness of approximation. Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z)
Unifying gradient regularization for Heterogeneous Graph Neural Networks [6.3093033645568015]
We propose a novel gradient regularization method called Grug, which iteratively applies regularization to the gradients generated by both propagated messages and the node features during the message-passing process. Grug provides a unified framework integrating graph topology and node features, based on which we conduct a detailed theoretical analysis of their effectiveness.
arXiv Detail & Related papers (2023-05-25T07:47:42Z)
Convergence of ease-controlled Random Reshuffling gradient Algorithms under Lipschitz smoothness [0.0]
We consider the average of a very large number of smooth possibly non-size functions, and we use two widely minibatch frameworks to tackle this problem. We define ease-controlled modifications of IG/RR schemes, which require a light additional computational effort. We prove our implementation with both a full batch gradient (i.e. L-BFGS) and an implementation of IG/RR methods, proving that algorithms require a similar computational effort.
arXiv Detail & Related papers (2022-12-04T15:26:36Z)
Comprehensive Graph Gradual Pruning for Sparse Training in Graph Neural Networks [52.566735716983956]
We propose a graph gradual pruning framework termed CGP to dynamically prune GNNs. Unlike LTH-based methods, the proposed CGP approach requires no re-training, which significantly reduces the computation costs. Our proposed strategy greatly improves both training and inference efficiency while matching or even exceeding the accuracy of existing methods.
arXiv Detail & Related papers (2022-07-18T14:23:31Z)
RawlsGCN: Towards Rawlsian Difference Principle on Graph Convolutional Network [102.27090022283208]
Graph Convolutional Network (GCN) plays pivotal roles in many real-world applications. GCN often exhibits performance disparity with respect to node degrees, resulting in worse predictive accuracy for low-degree nodes. We formulate the problem of mitigating the degree-related performance disparity in GCN from the perspective of the Rawlsian difference principle.
arXiv Detail & Related papers (2022-02-28T05:07:57Z)
Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL) New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.