Thermodynamic Natural Gradient Descent
- URL: http://arxiv.org/abs/2405.13817v1
- Date: Wed, 22 May 2024 16:47:03 GMT
- Title: Thermodynamic Natural Gradient Descent
- Authors: Kaelan Donatella, Samuel Duffield, Maxwell Aifer, Denis Melanson, Gavin Crooks, Patrick J. Coles,
- Abstract summary: We show that natural gradient descent can have a similar computational complexity per iteration to a first-order method.
We present a new hybrid digital-analog algorithm for training neural networks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Second-order training methods have better convergence properties than gradient descent but are rarely used in practice for large-scale training due to their computational overhead. This can be viewed as a hardware limitation (imposed by digital computers). Here we show that natural gradient descent (NGD), a second-order method, can have a similar computational complexity per iteration to a first-order method, when employing appropriate hardware. We present a new hybrid digital-analog algorithm for training neural networks that is equivalent to NGD in a certain parameter regime but avoids prohibitively costly linear system solves. Our algorithm exploits the thermodynamic properties of an analog system at equilibrium, and hence requires an analog thermodynamic computer. The training occurs in a hybrid digital-analog loop, where the gradient and Fisher information matrix (or any other positive semi-definite curvature matrix) are calculated at given time intervals while the analog dynamics take place. We numerically demonstrate the superiority of this approach over state-of-the-art digital first- and second-order training methods on classification tasks and language model fine-tuning tasks.
Related papers
- How to Train Your Resistive Network: Generalized Equilibrium Propagation and Analytical Learning [0.0]
We develop an algorithm to calculate gradients using a graph theoretic and analytical framework for Kirchhoff's laws.<n>We demonstrate our algorithm using numerical simulations and show that we can train resistor networks without the need for a replica or readout over all resistors, only at the output layer.
arXiv Detail & Related papers (2026-02-03T14:00:03Z) - Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration [67.12978375116599]
We show that the steps of gradient descent (GD) reduce to those of generalized perceptron algorithms.<n>This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks.
arXiv Detail & Related papers (2025-12-12T14:16:35Z) - First Demonstration of Second-order Training of Deep Neural Networks with In-memory Analog Matrix Computing [7.51466944063829]
We present the first demonstration of a second-order powered by in-memory analog matrix computing (AMC)<n>We validate the inversion by training a two-layer convolutional neural network (CNN) for handwritten letter classification.<n>Our system delivers a 5.88x improvement in throughput and a 6.9x gain in energy efficiency compared to state-of-the-art digital processors.
arXiv Detail & Related papers (2025-12-05T00:52:46Z) - Provably Efficient Quantum Algorithms for Solving Nonlinear Differential Equations Using Multiple Bosonic Modes Coupled with Qubits [9.366500214140164]
We present an analog, continuous-variable algorithm based on bosonic modes with qubit-based adaptive measurements that avoids Hilbert-space digitization.<n>Unlike many analog schemes, the algorithm is provably efficient: advancing a first-order, $L$-grid point, $d$-dimensional, order-$K$ spatial-derivative, degree-$r$-nonline.
arXiv Detail & Related papers (2025-11-13T04:09:32Z) - Self-Supervised Coarsening of Unstructured Grid with Automatic Differentiation [55.88862563823878]
In this work, we present an original algorithm to coarsen an unstructured grid based on the concepts of differentiable physics.<n>We demonstrate performance of the algorithm on two PDEs: a linear equation which governs slightly compressible fluid flow in porous media and the wave equation.<n>Our results show that in the considered scenarios, we reduced the number of grid points up to 10 times while preserving the modeled variable dynamics in the points of interest.
arXiv Detail & Related papers (2025-07-24T11:02:13Z) - Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis [0.0]
We present AdaFisher, a novel adaptive second-order tuning for deep neural networks (DNNs)
AdaFisher aims to bridge the gap between the improved convergence and generalization of second-order methods and the computational efficiency needed for trainings.
We demonstrate that AdaFisher outperforms state-of-the-art approximations in both accuracy and convergence speed.
arXiv Detail & Related papers (2025-04-26T05:02:21Z) - Enabling Automatic Differentiation with Mollified Graph Neural Operators [75.3183193262225]
We propose the mollified graph neural operator (mGNO), the first method to leverage automatic differentiation and compute emphexact gradients on arbitrary geometries.
For a PDE example on regular grids, mGNO paired with autograd reduced the L2 relative data error by 20x compared to finite differences.
It can also solve PDEs on unstructured point clouds seamlessly, using physics losses only, at resolutions vastly lower than those needed for finite differences to be accurate enough.
arXiv Detail & Related papers (2025-04-11T06:16:30Z) - Gradient-Free Neural Network Training on the Edge [12.472204825917629]
Training neural networks is computationally heavy and energy-intensive.
This work presents a novel technique for training neural networks without needing gradient.
We show that it is possible to train models without gradient-based optimization techniques by identifying erroneous contributions of each neuron towards the expected classification.
arXiv Detail & Related papers (2024-10-13T05:38:39Z) - Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks [3.680127959836384]
First-order methods, such as gradient descent (GD) and quadratic descent gradient (SGD) have been proven effective in training neural networks.
However, the learning rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix.
In this paper, we show that for the $L2$ regression problems, the learning rate can be improved from $mathcalO(1)$ to $mathcalO(1)$, which implies that GD actually enjoys a faster convergence rate.
arXiv Detail & Related papers (2024-08-01T14:06:34Z) - A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive
Coding Networks [65.34977803841007]
Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience.
We show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one.
arXiv Detail & Related papers (2022-11-16T00:11:04Z) - Virtual Analog Modeling of Distortion Circuits Using Neural Ordinary
Differential Equations [1.8352113484137629]
Recent research in deep learning has shown that neural networks can learn differential equations governing dynamical systems.
In this paper, we adapt this concept to Virtual Analog (VA) modeling to learn the ordinary differential equations (ODEs) governing the first-order and the second-order diode clipper.
The proposed models achieve performance comparable to state-of-the-art recurrent neural networks (RNNs) albeit using fewer parameters.
arXiv Detail & Related papers (2022-05-04T05:19:46Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - Gradient descent in materia through homodyne gradient extraction [2.012950941269354]
We demonstrate a simple yet efficient gradient extraction method, based on the principle of homodyne detection.
By perturbing the parameters that need to be optimized we effectively obtain the gradient information in a highly robust and scalable manner.
Homodyne gradient extraction can in principle be fully implemented in materia, facilitating the development of autonomously learning material systems.
arXiv Detail & Related papers (2021-05-15T12:18:31Z) - Convergence rates for gradient descent in the training of
overparameterized artificial neural networks with biases [3.198144010381572]
In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches.
It is still unclear why randomly gradient descent algorithms reach their limits.
arXiv Detail & Related papers (2021-02-23T18:17:47Z) - Sample Complexity Bounds for Two Timescale Value-based Reinforcement
Learning Algorithms [65.09383385484007]
Two timescale approximation (SA) has been widely used in value-based reinforcement learning algorithms.
We study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ algorithms.
arXiv Detail & Related papers (2020-11-10T11:36:30Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.