Related papers: Nystrom Method for Accurate and Scalable Implicit Differentiation

Nystrom Method for Accurate and Scalable Implicit Differentiation

URL: http://arxiv.org/abs/2302.09726v1
Date: Mon, 20 Feb 2023 02:37:26 GMT
Title: Nystrom Method for Accurate and Scalable Implicit Differentiation
Authors: Ryuichiro Hataya and Makoto Yamada
Abstract summary: We show that the Nystrom method consistently achieves comparable or even superior performance to other approaches. The proposed method avoids numerical instability and can be efficiently computed in matrix operations without iterations.
Score: 25.29277451838466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The essential difficulty of gradient-based bilevel optimization using implicit differentiation is to estimate the inverse Hessian vector product with respect to neural network parameters. This paper proposes to tackle this problem by the Nystrom method and the Woodbury matrix identity, exploiting the low-rankness of the Hessian. Compared to existing methods using iterative approximation, such as conjugate gradient and the Neumann series approximation, the proposed method avoids numerical instability and can be efficiently computed in matrix operations without iterations. As a result, the proposed method works stably in various tasks and is faster than iterative approximations. Throughout experiments including large-scale hyperparameter optimization and meta learning, we demonstrate that the Nystrom method consistently achieves comparable or even superior performance to other approaches. The source code is available from https://github.com/moskomule/hypergrad.

Related papers

A Natural Primal-Dual Hybrid Gradient Method for Adversarial Neural Network Training on Solving Partial Differential Equations [9.588717577573684]
We propose a scalable preconditioned primal hybrid gradient algorithm for solving partial differential equations (PDEs) We compare the performance of the proposed method with several commonly used deep learning algorithms. The numerical results suggest that the proposed method performs efficiently and robustly and converges more stably.
arXiv Detail & Related papers (2024-11-09T20:39:10Z)
Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment. We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z)
Efficient Computation of Sparse and Robust Maximum Association Estimators [0.4588028371034406]
Robust statistical estimators offer empirical precision but are often computationally challenging in high-dimensional sparse settings. Modern association estimator techniques are utilized for outliers without imposing resilience against other robust methods.
arXiv Detail & Related papers (2023-11-29T11:57:50Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
Neural incomplete factorization: learning preconditioners for the conjugate gradient method [2.899792823251184]
We develop a data-driven approach to accelerate the generation of effective preconditioners. We replace the typically hand-engineered preconditioners by the output of graph neural networks. Our method generates an incomplete factorization of the matrix and is, therefore, referred to as neural incomplete factorization (NeuralIF)
arXiv Detail & Related papers (2023-05-25T11:45:46Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks. The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z)
Converting ADMM to a Proximal Gradient for Convex Optimization Problems [4.56877715768796]
In sparse estimation, such as fused lasso and convex clustering, we apply either the proximal gradient method or the alternating direction method of multipliers (ADMM) to solve the problem. This paper proposes a general method for converting the ADMM solution to the proximal gradient method, assuming that the constraints and objectives are strongly convex. We show by numerical experiments that we can obtain a significant improvement in terms of efficiency.
arXiv Detail & Related papers (2021-04-22T07:41:12Z)
Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information. We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z)
A Hybrid-Order Distributed SGD Method for Non-Convex Optimization to Balance Communication Overhead, Computational Complexity, and Convergence Rate [28.167294398293297]
We propose a method of distributed gradient descent (SGD) with low communication load and computational complexity, and still fast. To reduce the computational complexity in each iteration, the worker nodes approximate the directional derivatives with zeroth-order gradient estimation.
arXiv Detail & Related papers (2020-03-27T14:02:15Z)
Implicit differentiation of Lasso-type models for hyperparameter optimization [82.73138686390514]
We introduce an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems. Our approach scales to high-dimensional data by leveraging the sparsity of the solutions.
arXiv Detail & Related papers (2020-02-20T18:43:42Z)
Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients. We introduce a new sparsity operator: The random-top-k operator. Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.