Joint inference and input optimization in equilibrium networks
- URL: http://arxiv.org/abs/2111.13236v1
- Date: Thu, 25 Nov 2021 19:59:33 GMT
- Title: Joint inference and input optimization in equilibrium networks
- Authors: Swaminathan Gurumurthy, Shaojie Bai, Zachary Manchester, J. Zico
Kolter
- Abstract summary: deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
- Score: 68.63726855991052
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many tasks in deep learning involve optimizing over the \emph{inputs} to a
network to minimize or maximize some objective; examples include optimization
over latent spaces in a generative model to match a target image, or
adversarially perturbing an input to worsen classifier performance. Performing
such optimization, however, is traditionally quite costly, as it involves a
complete forward and backward pass through the network for each gradient step.
In a separate line of work, a recent thread of research has developed the deep
equilibrium (DEQ) model, a class of models that foregoes traditional network
depth and instead computes the output of a network by finding the fixed point
of a single nonlinear layer. In this paper, we show that there is a natural
synergy between these two settings. Although, naively using DEQs for these
optimization problems is expensive (owing to the time needed to compute a fixed
point for each gradient step), we can leverage the fact that gradient-based
optimization can \emph{itself} be cast as a fixed point iteration to
substantially improve the overall speed. That is, we \emph{simultaneously} both
solve for the DEQ fixed point \emph{and} optimize over network inputs, all
within a single ``augmented'' DEQ model that jointly encodes both the original
network and the optimization process. Indeed, the procedure is fast enough that
it allows us to efficiently \emph{train} DEQ models for tasks traditionally
relying on an ``inner'' optimization loop. We demonstrate this strategy on
various tasks such as training generative models while optimizing over latent
codes, training models for inverse problems like denoising and inpainting,
adversarial training and gradient based meta-learning.
Related papers
- Neural Quantile Optimization for Edge-Cloud Networking [13.509945075582447]
We seek the best traffic allocation scheme for the edge-cloud computing network that satisfies constraints and minimizes the cost based on burstable billing.
We introduce the Gumbel-softmax sampling network to solve the optimization problems via unsupervised learning.
The trained network works as an efficient traffic allocation scheme sampler, remarkably outperforming the random strategy in feasibility and cost function value.
arXiv Detail & Related papers (2023-07-11T11:05:10Z) - Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Combinatorial optimization for low bit-width neural networks [23.466606660363016]
Low-bit width neural networks have been extensively explored for deployment on edge devices to reduce computational resources.
Existing approaches have focused on gradient-based optimization in a two-stage train-and-compress setting.
We show that a combination of greedy coordinate descent and this novel approach can attain competitive accuracy on binary classification tasks.
arXiv Detail & Related papers (2022-06-04T15:02:36Z) - Half-Inverse Gradients for Physical Deep Learning [25.013244956897832]
Integrating differentiable physics simulators into the training process can greatly improve the quality of results.
The gradient-based solvers have a profound effect on the gradient flow as manipulating scales in magnitude and direction is an inherent property of many physical processes.
In this work, we analyze the characteristics of both physical and neural network optimizations to derive a new method that does not suffer from this phenomenon.
arXiv Detail & Related papers (2022-03-18T19:11:04Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - A Flexible Framework for Designing Trainable Priors with Adaptive
Smoothing and Game Encoding [57.1077544780653]
We introduce a general framework for designing and training neural network layers whose forward passes can be interpreted as solving non-smooth convex optimization problems.
We focus on convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions.
This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end.
arXiv Detail & Related papers (2020-06-26T08:34:54Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Dynamic Hierarchical Mimicking Towards Consistent Optimization
Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability.
Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network.
Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.