Learning Discretized Neural Networks under Ricci Flow
- URL: http://arxiv.org/abs/2302.03390v4
- Date: Thu, 4 Jan 2024 14:18:56 GMT
- Title: Learning Discretized Neural Networks under Ricci Flow
- Authors: Jun Chen, Hanwen Chen, Mengmeng Wang, Guang Dai, Ivor W. Tsang, Yong
Liu
- Abstract summary: We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
- Score: 51.36292559262042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study Discretized Neural Networks (DNNs) composed of
low-precision weights and activations, which suffer from either infinite or
zero gradients due to the non-differentiable discrete function during training.
Most training-based DNNs in such scenarios employ the standard Straight-Through
Estimator (STE) to approximate the gradient w.r.t. discrete values. However,
the use of STE introduces the problem of gradient mismatch, arising from
perturbations in the approximated gradient. To address this problem, this paper
reveals that this mismatch can be interpreted as a metric perturbation in a
Riemannian manifold, viewed through the lens of duality theory. Building on
information geometry, we construct the Linearly Nearly Euclidean (LNE) manifold
for DNNs, providing a background for addressing perturbations. By introducing a
partial differential equation on metrics, i.e., the Ricci flow, we establish
the dynamical stability and convergence of the LNE metric with the $L^2$-norm
perturbation. In contrast to previous perturbation theories with convergence
rates in fractional powers, the metric perturbation under the Ricci flow
exhibits exponential decay in the LNE manifold. Experimental results across
various datasets demonstrate that our method achieves superior and more stable
performance for DNNs compared to other representative training-based methods.
Related papers
- Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - A Geometric Perspective on Diffusion Models [57.27857591493788]
We inspect the ODE-based sampling of a popular variance-exploding SDE.
We establish a theoretical relationship between the optimal ODE-based sampling and the classic mean-shift (mode-seeking) algorithm.
arXiv Detail & Related papers (2023-05-31T15:33:16Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Toward Equation of Motion for Deep Neural Networks: Continuous-time
Gradient Descent and Discretization Error Analysis [5.71097144710995]
We derive and solve an Equation of Motion'' (EoM) for deep neural networks (DNNs)
EoM is a continuous differential equation that precisely describes the discrete learning dynamics of GD.
arXiv Detail & Related papers (2022-10-28T05:13:50Z) - Designing Universal Causal Deep Learning Models: The Case of
Infinite-Dimensional Dynamical Systems from Stochastic Analysis [3.5450828190071655]
Causal operators (COs) play a central role in contemporary analysis.
There is still no canonical framework for designing Deep Learning (DL) models capable of approximating COs.
This paper proposes a "geometry-aware" solution to this open problem by introducing a DL model-design framework.
arXiv Detail & Related papers (2022-10-24T14:43:03Z) - A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks [12.355137704908042]
We show restrained numerical instabilities in current training practices of deep networks with gradient descent (SGD)
We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs)
We show this is a consequence of the non-linear PDE associated with the descent of the CNN, whose local linearization changes when over-driving the step size of the discretization resulting in a stabilizing effect.
arXiv Detail & Related papers (2022-06-04T14:54:05Z) - Learning via nonlinear conjugate gradients and depth-varying neural ODEs [5.565364597145568]
The inverse problem of supervised reconstruction of depth-variable parameters in a neural ordinary differential equation (NODE) is considered.
The proposed parameter reconstruction is done for a general first order differential equation by minimizing a cost functional.
The sensitivity problem can estimate changes in the network output under perturbation of the trained parameters.
arXiv Detail & Related papers (2022-02-11T17:00:48Z) - On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes.
We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z) - Stationary Density Estimation of It\^o Diffusions Using Deep Learning [6.8342505943533345]
We consider the density estimation problem associated with the stationary measure of ergodic Ito diffusions from a discrete-time series.
We employ deep neural networks to approximate the drift and diffusion terms of the SDE.
We establish the convergence of the proposed scheme under appropriate mathematical assumptions.
arXiv Detail & Related papers (2021-09-09T01:57:14Z) - Incorporating NODE with Pre-trained Neural Differential Operator for
Learning Dynamics [73.77459272878025]
We propose to enhance the supervised signal in learning dynamics by pre-training a neural differential operator (NDO)
NDO is pre-trained on a class of symbolic functions, and it learns the mapping between the trajectory samples of these functions to their derivatives.
We provide theoretical guarantee on that the output of NDO can well approximate the ground truth derivatives by proper tuning the complexity of the library.
arXiv Detail & Related papers (2021-06-08T08:04:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.