Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning
- URL: http://arxiv.org/abs/2406.03276v2
- Date: Wed, 3 Jul 2024 21:22:00 GMT
- Title: Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning
- Authors: Mohamed Elsayed, Homayoon Farrahi, Felix Dangel, A. Rupam Mahmood,
- Abstract summary: Second-order information is valuable for many applications but challenging to compute.
We introduce HesScale, an improvement over BL89, which adds negligible extra computation.
On small networks, we find that this improvement is of higher quality than all alternatives, even those with theoretical guarantees, such as unbiasedness, while being much cheaper to compute.
- Score: 6.383513606898132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Second-order information is valuable for many applications but challenging to compute. Several works focus on computing or approximating Hessian diagonals, but even this simplification introduces significant additional costs compared to computing a gradient. In the absence of efficient exact computation schemes for Hessian diagonals, we revisit an early approximation scheme proposed by Becker and LeCun (1989, BL89), which has a cost similar to gradients and appears to have been overlooked by the community. We introduce HesScale, an improvement over BL89, which adds negligible extra computation. On small networks, we find that this improvement is of higher quality than all alternatives, even those with theoretical guarantees, such as unbiasedness, while being much cheaper to compute. We use this insight in reinforcement learning problems where small networks are used and demonstrate HesScale in second-order optimization and scaling the step-size parameter. In our experiments, HesScale optimizes faster than existing methods and improves stability through step-size scaling. These findings are promising for scaling second-order methods in larger models in the future.
Related papers
- Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis [0.0]
We present AdaFisher, a novel adaptive second-order tuning for deep neural networks (DNNs)
AdaFisher aims to bridge the gap between the improved convergence and generalization of second-order methods and the computational efficiency needed for trainings.
We demonstrate that AdaFisher outperforms state-of-the-art approximations in both accuracy and convergence speed.
arXiv Detail & Related papers (2025-04-26T05:02:21Z) - SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation [5.523554757946985]
Sassha is a novel second-order method designed to enhance generalization by explicitly reducing sharpness of the solution.
We provide a comprehensive set of analyses including convergence, robustness, stability, efficiency, and cost.
arXiv Detail & Related papers (2025-02-25T12:35:05Z) - Approximating Metric Magnitude of Point Sets [4.522729058300309]
Metric magnitude is a measure of the "size" of point clouds with many desirable geometric properties.
It has been adapted to various mathematical contexts and recent work suggests that it can enhance machine learning and optimization algorithms.
In this paper, we study the magnitude problem, and show efficient ways of approximating it. We show that it can be cast as a convex optimization problem, but not as a submodular optimization.
The paper describes two new algorithms - an iterative approximation algorithm that converges fast and is accurate, and a subset selection method that makes the computation even faster.
arXiv Detail & Related papers (2024-09-06T17:15:28Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - HesScale: Scalable Computation of Hessian Diagonals [2.398608007786179]
HesScale is a scalable approach to approximating the diagonal of the Hessian matrix.
We show that HesScale has the same computational complexity as backpropagation.
arXiv Detail & Related papers (2022-10-20T23:50:56Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Second-order Neural Network Training Using Complex-step Directional
Derivative [41.4333906662624]
We introduce a numerical algorithm for second-order neural network training.
We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference.
We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.
arXiv Detail & Related papers (2020-09-15T13:46:57Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Effective Dimension Adaptive Sketching Methods for Faster Regularized
Least-Squares Optimization [56.05635751529922]
We propose a new randomized algorithm for solving L2-regularized least-squares problems based on sketching.
We consider two of the most popular random embeddings, namely, Gaussian embeddings and the Subsampled Randomized Hadamard Transform (SRHT)
arXiv Detail & Related papers (2020-06-10T15:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.