SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models
- URL: http://arxiv.org/abs/2106.00553v1
- Date: Tue, 1 Jun 2021 15:07:34 GMT
- Title: SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models
- Authors: Zaccharie Ramzi, Florian Mannel, Shaojie Bai, Jean-Luc Starck,
Philippe Ciuciu, Thomas Moreau
- Abstract summary: In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
- Score: 15.541264326378366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, implicit deep learning has emerged as a method to increase
the depth of deep neural networks. While their training is memory-efficient,
they are still significantly slower to train than their explicit counterparts.
In Deep Equilibrium Models (DEQs), the training is performed as a bi-level
problem, and its computational complexity is partially driven by the iterative
inversion of a huge Jacobian matrix. In this paper, we propose a novel strategy
to tackle this computational bottleneck from which many bi-level problems
suffer. The main idea is to use the quasi-Newton matrices from the forward pass
to efficiently approximate the inverse Jacobian matrix in the direction needed
for the gradient computation. We provide a theorem that motivates using our
method with the original forward algorithms. In addition, by modifying these
forward algorithms, we further provide theoretical guarantees that our method
asymptotically estimates the true implicit gradient. We empirically study this
approach in many settings, ranging from hyperparameter optimization to large
Multiscale DEQs applied to CIFAR and ImageNet. We show that it reduces the
computational cost of the backward pass by up to two orders of magnitude. All
this is achieved while retaining the excellent performance of the original
models in hyperparameter optimization and on CIFAR, and giving encouraging and
competitive results on ImageNet.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement [29.675650285351768]
Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks.
Approximate MU is a practical method for large-scale models.
We propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction.
arXiv Detail & Related papers (2024-09-29T15:17:33Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Efficient Training of Deep Equilibrium Models [6.744714965617125]
Deep equilibrium models (DEQs) have proven to be very powerful for learning data representations.
The idea is to replace traditional (explicit) feedforward neural networks with an implicit fixed-point equation.
Backpropagation through DEQ layers still requires solving an expensive Jacobian-based equation.
arXiv Detail & Related papers (2023-04-23T14:20:09Z) - Nystrom Method for Accurate and Scalable Implicit Differentiation [25.29277451838466]
We show that the Nystrom method consistently achieves comparable or even superior performance to other approaches.
The proposed method avoids numerical instability and can be efficiently computed in matrix operations without iterations.
arXiv Detail & Related papers (2023-02-20T02:37:26Z) - Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z) - Gradient Descent, Stochastic Optimization, and Other Tales [8.034728173797953]
This tutorial doesn't shy away from addressing both the formal and informal aspects of gradient descent and optimization methods.
Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize machine learning tasks.
In deep neural networks, the gradient followed by a single sample or a batch of samples is employed to save computational resources and escape from saddle points.
arXiv Detail & Related papers (2022-05-02T12:06:53Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - ES-Based Jacobian Enables Faster Bilevel Optimization [53.675623215542515]
Bilevel optimization (BO) has arisen as a powerful tool for solving many modern machine learning problems.
Existing gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations.
We propose a novel BO algorithm, which adopts Evolution Strategies (ES) based method to approximate the response Jacobian matrix in the hypergradient of BO.
arXiv Detail & Related papers (2021-10-13T19:36:50Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.