An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU
Gate
- URL: http://arxiv.org/abs/2204.12554v1
- Date: Tue, 26 Apr 2022 19:28:51 GMT
- Title: An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU
Gate
- Authors: Sayar Karmakar and Anirbit Mukherjee
- Abstract summary: We conjecture that two algorithms have similar heavy-tail behaviour on any data where the latter can be proven to converge.
We demonstrate that the heavy-tail index of the late time iterates in this model scenario has strikingly different properties than either what has been proven for linear hypothesis classes or what has been previously demonstrated for large nets.
- Score: 0.7614628596146599
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A particular direction of recent advance about stochastic deep-learning
algorithms has been about uncovering a rather mysterious heavy-tailed nature of
the stationary distribution of these algorithms, even when the data
distribution is not so. Moreover, the heavy-tail index is known to show
interesting dependence on the input dimension of the net, the mini-batch size
and the step size of the algorithm. In this short note, we undertake an
experimental study of this index for S.G.D. while training a $\relu$ gate (in
the realizable and in the binary classification setup) and for a variant of
S.G.D. that was proven in Karmakar and Mukherjee (2022) for ReLU realizable
data. From our experiments we conjecture that these two algorithms have similar
heavy-tail behaviour on any data where the latter can be proven to converge.
Secondly, we demonstrate that the heavy-tail index of the late time iterates in
this model scenario has strikingly different properties than either what has
been proven for linear hypothesis classes or what has been previously
demonstrated for large nets.
Related papers
- Random features models: a way to study the success of naive imputation [0.0]
Constant (naive) imputation is still widely used in practice as this is a first easy-to-use technique to deal with missing data.
Recent works suggest that this bias is low in the context of high-dimensional linear predictors.
This paper confirms the intuition that the bias is negligible and that surprisingly naive imputation also remains relevant in very low dimension.
arXiv Detail & Related papers (2024-02-06T09:37:06Z) - Regularization-Based Methods for Ordinal Quantification [49.606912965922504]
We study the ordinal case, i.e., the case in which a total order is defined on the set of n>2 classes.
We propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments.
arXiv Detail & Related papers (2023-10-13T16:04:06Z) - Escaping mediocrity: how two-layer networks learn hard generalized
linear models with SGD [29.162265194920522]
This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Gradient Descent (SGD)
We show that overfactorization can only enhance convergence by a constant factor within this problem class.
Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of SGDity may be minimal in this scenario.
arXiv Detail & Related papers (2023-05-29T14:40:56Z) - What learning algorithm is in-context learning? Investigations with
linear models [87.91612418166464]
We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly.
We show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression.
Preliminary evidence that in-context learners share algorithmic features with these predictors.
arXiv Detail & Related papers (2022-11-28T18:59:51Z) - Parametric Classification for Generalized Category Discovery: A Baseline
Study [70.73212959385387]
Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples.
We investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem.
We propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers.
arXiv Detail & Related papers (2022-11-21T18:47:11Z) - Heavy-Tail Phenomenon in Decentralized SGD [33.63000461985398]
We study the emergence of heavy-tails in decentralized gradient descent (DE-SGD)
We also investigate the effect of decentralization on the tail behavior.
Our theory uncovers an interesting interplay between the tails and the network structure.
arXiv Detail & Related papers (2022-05-13T14:47:04Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - Theoretical Insights Into Multiclass Classification: A High-dimensional
Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification.
Our analysis reveals that the classification accuracy is highly distribution-dependent.
The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z) - Provable Training of a ReLU Gate with an Iterative Non-Gradient
Algorithm [0.7614628596146599]
We show provable guarantees on the training of a single ReLU gate in hitherto unexplored regimes.
We show a first-of-its-kind approximate recovery of the true label generating parameters under an (online) data-poisoning attack on the true labels.
Our guarantee is shown to be nearly optimal in the worst case and its accuracy of recovering the true weight degrades gracefully with increasing probability of attack and its magnitude.
arXiv Detail & Related papers (2020-05-08T17:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.