Related papers: GD doesn't make the cut: Three ways that non-differentiability affects neural network training

GD doesn't make the cut: Three ways that non-differentiability affects neural network training

URL: http://arxiv.org/abs/2401.08426v3
Date: Thu, 9 May 2024 00:45:37 GMT
Title: GD doesn't make the cut: Three ways that non-differentiability affects neural network training
Authors: Siddharth Krishna Kumar,
Abstract summary: We investigate the distinctions between applied non-differentiable functions (NGDMs) and classical gradient descents (GDs) We show that increasing the regularization leads to an increase in the $L_1$ norm of optimal solutions in NGDMs. We also show that widely adopted $L_1$ization-based techniques for network pruning do not yield expected results.
Score: 5.439020425819001
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper investigates the distinctions between gradient methods applied to non-differentiable functions (NGDMs) and classical gradient descents (GDs) designed for differentiable functions. First, we demonstrate significant differences in the convergence properties of NGDMs compared to GDs, challenging the applicability of the extensive neural network convergence literature based on $L-smoothness$ to non-smooth neural networks. Next, we demonstrate the paradoxical nature of NGDM solutions for $L_{1}$-regularized problems, showing that increasing the regularization penalty leads to an increase in the $L_{1}$ norm of optimal solutions in NGDMs. Consequently, we show that widely adopted $L_{1}$ penalization-based techniques for network pruning do not yield expected results. Additionally, we dispel the common belief that optimization algorithms like Adam and RMSProp perform similarly in non-differentiable contexts. Finally, we explore the Edge of Stability phenomenon, indicating its inapplicability even to Lipschitz continuous convex differentiable functions, leaving its relevance to non-convex non-differentiable neural networks inconclusive. Our analysis exposes misguided interpretations of NGDMs in widely referenced papers and texts due to an overreliance on strong smoothness assumptions, emphasizing the necessity for a nuanced understanding of foundational assumptions in the analysis of these systems.

Related papers

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
On the Convergence Analysis of Over-Parameterized Variational Autoencoders: A Neural Tangent Kernel Perspective [7.580900499231056]
Variational Auto-Encoders (VAEs) have emerged as powerful probabilistic models for generative tasks. This paper provides a mathematical proof of VAE under mild assumptions. We also establish a novel connection between the optimization problem faced by over-Eized SNNs and the Kernel Ridge (KRR) problem.
arXiv Detail & Related papers (2024-09-09T06:10:31Z)
Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems. We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z)
Physics-Informed Neural Networks: Minimizing Residual Loss with Wide Networks and Effective Activations [5.731640425517324]
We show that under certain conditions, the residual loss of PINNs can be globally minimized by a wide neural network. An activation function with well-behaved high-order derivatives plays a crucial role in minimizing the residual loss. The established theory paves the way for designing and choosing effective activation functions for PINNs.
arXiv Detail & Related papers (2024-05-02T19:08:59Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Approximation Power of Deep Neural Networks: an explanatory mathematical survey [0.0]
The goal of this survey is to present an explanatory review of the approximation properties of deep neural networks. We aim at understanding how and why deep neural networks outperform other classical linear and nonlinear approximation methods.
arXiv Detail & Related papers (2022-07-19T18:47:44Z)
Momentum Diminishes the Effect of Spectral Bias in Physics-Informed Neural Networks [72.09574528342732]
Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs) They often fail to converge to desirable solutions when the target function contains high-frequency features, due to a phenomenon known as spectral bias. In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under gradient descent with momentum (SGDM)
arXiv Detail & Related papers (2022-06-29T19:03:10Z)
Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure. We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z)
NTopo: Mesh-free Topology Optimization using Implicit Neural Representations [35.07884509198916]
We present a novel machine learning approach to tackle topology optimization problems. We use multilayer perceptrons (MLPs) to parameterize both density and displacement fields. As we show through our experiments, a major benefit of our approach is that it enables self-supervised learning of continuous solution spaces.
arXiv Detail & Related papers (2021-02-22T05:25:22Z)
Analytical aspects of non-differentiable neural networks [0.0]
We discuss the expressivity of quantized neural networks and approximation techniques for non-differentiable networks. We show that QNNs have the same expressivity as DNNs in terms of approximation of Lipschitz functions in the $Linfty$ norm. We also consider networks defined by means of Heaviside-type activation functions, and prove for them a pointwise approximation result by means of smooth networks.
arXiv Detail & Related papers (2020-11-03T17:20:43Z)
A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks [23.038631072178735]
We consider a broad class of optimization algorithms that are commonly used in practice. As a consequence, we can leverage the convergence behavior of neural networks. We believe our approach can also be extended to other optimization algorithms and network theory.
arXiv Detail & Related papers (2020-10-25T17:10:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.