Behind the Scenes of Gradient Descent: A Trajectory Analysis via Basis
Function Decomposition
- URL: http://arxiv.org/abs/2210.00346v2
- Date: Tue, 4 Oct 2022 01:33:28 GMT
- Title: Behind the Scenes of Gradient Descent: A Trajectory Analysis via Basis
Function Decomposition
- Authors: Jianhao Ma, Lingjun Guo, Salar Fattahi
- Abstract summary: This work analyzes the solution trajectory of gradient-based algorithms via a novel basis function decomposition.
We show that, although solution trajectories of gradient-based algorithms may vary depending on the learning task, they behave almost monotonically when projected onto an appropriate orthonormal function basis.
- Score: 4.01776052820812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work analyzes the solution trajectory of gradient-based algorithms via a
novel basis function decomposition. We show that, although solution
trajectories of gradient-based algorithms may vary depending on the learning
task, they behave almost monotonically when projected onto an appropriate
orthonormal function basis. Such projection gives rise to a basis function
decomposition of the solution trajectory. Theoretically, we use our proposed
basis function decomposition to establish the convergence of gradient descent
(GD) on several representative learning tasks. In particular, we improve the
convergence of GD on symmetric matrix factorization and provide a completely
new convergence result for the orthogonal symmetric tensor decomposition.
Empirically, we illustrate the promise of our proposed framework on realistic
deep neural networks (DNNs) across different architectures, gradient-based
solvers, and datasets. Our key finding is that gradient-based algorithms
monotonically learn the coefficients of a particular orthonormal function basis
of DNNs defined as the eigenvectors of the conjugate kernel after training. Our
code is available at https://github.com/jianhaoma/function-basis-decomposition.
Related papers
- Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems.
We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - A Globally Convergent Algorithm for Neural Network Parameter
Optimization Based on Difference-of-Convex Functions [29.58728073957055]
We propose an algorithm for optimizing parameters of hidden layer networks.
Specifically, we derive a blockwise (DC-of-the-art) difference function.
arXiv Detail & Related papers (2024-01-15T19:53:35Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - NeuralEF: Deconstructing Kernels by Deep Neural Networks [47.54733625351363]
Traditional nonparametric solutions based on the Nystr"om formula suffer from scalability issues.
Recent work has resorted to a parametric approach, i.e., training neural networks to approximate the eigenfunctions.
We show that these problems can be fixed by using a new series of objective functions that generalizes to space of supervised and unsupervised learning problems.
arXiv Detail & Related papers (2022-04-30T05:31:07Z) - q-RBFNN:A Quantum Calculus-based RBF Neural Network [31.14412266444568]
A gradient descent based learning approach for the radial basis function neural networks (RBFNN) is proposed.
The proposed method is based on the q-gradient which is also known as Jackson derivative.
The proposed $q$-RBFNN is analyzed for its convergence performance in the context of least square algorithm.
arXiv Detail & Related papers (2021-06-02T08:27:12Z) - A proof of convergence for gradient descent in the training of
artificial neural networks for constant target functions [3.4792548480344254]
We show that the risk function of the gradient descent method does indeed converge to zero.
A key contribution of this work is to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters.
arXiv Detail & Related papers (2021-02-19T13:33:03Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z) - Multipole Graph Neural Operator for Parametric Partial Differential
Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data.
We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity.
Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.