Related papers: Learning a Sparse Neural Network using IHT

Learning a Sparse Neural Network using IHT

URL: http://arxiv.org/abs/2404.18414v2
Date: Wed, 17 Jul 2024 16:51:36 GMT
Title: Learning a Sparse Neural Network using IHT
Authors: Saeed Damadi, Soroush Zolfaghari, Mahdi Rezaie, Jinglai Shen,
Abstract summary: This paper relies on results from the domain of advanced sparse optimization, particularly those addressing nonlinear differentiable functions. As computational power for training NNs increases, so does the complexity of the models in terms of a higher number of parameters. This paper aims to investigate whether the theoretical prerequisites for such convergence are applicable in the realm of neural network (NN) training.
Score: 1.124958340749622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The core of a good model is in its ability to focus only on important information that reflects the basic patterns and consistencies, thus pulling out a clear, noise-free signal from the dataset. This necessitates using a simplified model defined by fewer parameters. The importance of theoretical foundations becomes clear in this context, as this paper relies on established results from the domain of advanced sparse optimization, particularly those addressing nonlinear differentiable functions. The need for such theoretical foundations is further highlighted by the trend that as computational power for training NNs increases, so does the complexity of the models in terms of a higher number of parameters. In practical scenarios, these large models are often simplified to more manageable versions with fewer parameters. Understanding why these simplified models with less number of parameters remain effective raises a crucial question. Understanding why these simplified models with fewer parameters remain effective raises an important question. This leads to the broader question of whether there is a theoretical framework that can clearly explain these empirical observations. Recent developments, such as establishing necessary conditions for the convergence of iterative hard thresholding (IHT) to a sparse local minimum (a sparse method analogous to gradient descent) are promising. The remarkable capacity of the IHT algorithm to accurately identify and learn the locations of nonzero parameters underscores its practical effectiveness and utility. This paper aims to investigate whether the theoretical prerequisites for such convergence are applicable in the realm of neural network (NN) training by providing justification for all the necessary conditions for convergence. Then, these conditions are validated by experiments on a single-layer NN, using the IRIS dataset as a testbed.

Related papers

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
Preconditioned Inexact Stochastic ADMM for Deep Model [35.37705488695026]
This paper develops an algorithm, PISA, which enables scalable parallel computing and supports various preconditions.<n>It converges under the sole assumption of Lipschitz continuity of the gradient on a bounded region, removing the need for other conditions commonly imposed by methods.<n>It demonstrates its superior numerical performance compared to various state-of-the-art iterations.
arXiv Detail & Related papers (2025-02-15T12:28:51Z)
Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime. We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels [14.186776881154127]
This paper focuses on empirical risk minimization in two settings, namely, random features and the neural tangent kernel (NTK) We prove that, for random features, the model is not robust for any degree of over- parameterization, even when the necessary condition coming from the universal law of robustness is satisfied. Our results are corroborated by numerical evidence on both synthetic and standard prototypical datasets.
arXiv Detail & Related papers (2023-02-03T09:58:31Z)
Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory. Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z)
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning [37.01683478234978]
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of over parameterized models.
arXiv Detail & Related papers (2021-09-06T10:48:40Z)
Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks [38.153825455980645]
Recent empirical evidence indicates that the practice of overization not only benefits training large models, but also assists - perhaps counterintuitively - building lightweight models. This paper sheds light on these empirical findings by theoretically characterizing the high-dimensional toolsets of model pruning. We analytically identify regimes in which, even if the location of the most informative features is known, we are better off fitting a large model and then pruning.
arXiv Detail & Related papers (2020-12-16T05:13:30Z)
DessiLBI: Exploring Structural Sparsity of Deep Networks via Differential Inclusion Paths [45.947140164621096]
We propose a new approach based on differential inclusions of inverse scale spaces. We show that DessiLBI unveils "winning tickets" in early epochs.
arXiv Detail & Related papers (2020-07-04T04:40:16Z)
Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs) In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit. We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z)
Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
Learning CHARME models with neural networks [1.5362025549031046]
We consider a model called CHARME (Conditional Heteroscedastic Autoregressive Mixture of Experts) As an application, we develop a learning theory for the NN-based autoregressive functions of the model.
arXiv Detail & Related papers (2020-02-08T21:51:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.