Nonsmooth automatic differentiation: a cheap gradient principle and
other complexity results
- URL: http://arxiv.org/abs/2206.01730v1
- Date: Wed, 1 Jun 2022 08:43:35 GMT
- Title: Nonsmooth automatic differentiation: a cheap gradient principle and
other complexity results
- Authors: J\'er\^ome Bolte (TSE), Ryan Boustany (TSE), Edouard Pauwels (IRIT),
B\'eatrice Pesquet-Popescu
- Abstract summary: We provide a model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs.
Prominent examples are the famous relu and convolutional neural networks together with their standard loss functions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide a simple model to estimate the computational costs of the backward
and forward modes of algorithmic differentiation for a wide class of nonsmooth
programs. Prominent examples are the famous relu and convolutional neural
networks together with their standard loss functions. Using the recent notion
of conservative gradients, we then establish a "nonsmooth cheap gradient
principle" for backpropagation encompassing most concrete applications.
Nonsmooth backpropagation's cheapness contrasts with concurrent forward
approaches which have, at this day, dimensional-dependent worst case estimates.
In order to understand this class of methods, we relate the complexity of
computing a large number of directional derivatives to that of matrix
multiplication. This shows a fundamental limitation for improving forward AD
for that task. Finally, while the fastest algorithms for computing a Clarke
subgradient are linear in the dimension, it appears that computing two distinct
Clarke (resp. lexicographic) subgradients for simple neural networks is
NP-Hard.
Related papers
- Limited Memory Online Gradient Descent for Kernelized Pairwise Learning
with Dynamic Averaging [18.843097436906618]
We introduce a lightweight OGD algorithm that does not require the independence of examples and generalizes to kernel pairwise learning.
Our algorithm builds the gradient based on a random example and a moving average representing the past data, which results in a sub-linear regret bound with a complexity of $O(T)$.
Several experiments with real-world datasets show that the complexity technique outperforms kernel and linear gradient in offline and online scenarios.
arXiv Detail & Related papers (2024-02-02T05:21:50Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Improving quantum linear system solvers via a gradient descent
perspective [3.0969191504482247]
We revisit quantum linear system solvers from the perspective of convex optimization.
This leads to a considerable constant-factor iteration in the runtime.
We show how the optimal quantum linear system solver of Childs, Kothari, and Somma is related to the gradient descent algorithm.
arXiv Detail & Related papers (2021-09-09T13:16:28Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Convergence rates for gradient descent in the training of
overparameterized artificial neural networks with biases [3.198144010381572]
In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches.
It is still unclear why randomly gradient descent algorithms reach their limits.
arXiv Detail & Related papers (2021-02-23T18:17:47Z) - Activation Relaxation: A Local Dynamical Approximation to
Backpropagation in the Brain [62.997667081978825]
Activation Relaxation (AR) is motivated by constructing the backpropagation gradient as the equilibrium point of a dynamical system.
Our algorithm converges rapidly and robustly to the correct backpropagation gradients, requires only a single type of computational unit, and can operate on arbitrary computation graphs.
arXiv Detail & Related papers (2020-09-11T11:56:34Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.