Related papers: Learning a Single Neuron for Non-monotonic Activation Functions

Learning a Single Neuron for Non-monotonic Activation Functions

URL: http://arxiv.org/abs/2202.08064v1
Date: Wed, 16 Feb 2022 13:44:25 GMT
Title: Learning a Single Neuron for Non-monotonic Activation Functions
Authors: Lei Wu
Abstract summary: Non-monotonic activation functions outperform the traditional monotonic ones in many applications. We show that mild conditions on $sigma$ are sufficient to guarantee the learnability in samples time. We also discuss how our positive results are related to existing negative results on training two-layer neural networks.
Score: 3.890410443467757
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of learning a single neuron $\mathbf{x}\mapsto \sigma(\mathbf{w}^T\mathbf{x})$ with gradient descent (GD). All the existing positive results are limited to the case where $\sigma$ is monotonic. However, it is recently observed that non-monotonic activation functions outperform the traditional monotonic ones in many applications. To fill this gap, we establish learnability without assuming monotonicity. Specifically, when the input distribution is the standard Gaussian, we show that mild conditions on $\sigma$ (e.g., $\sigma$ has a dominating linear part) are sufficient to guarantee the learnability in polynomial time and polynomial samples. Moreover, with a stronger assumption on the activation function, the condition of input distribution can be relaxed to a non-degeneracy of the marginal distribution. We remark that our conditions on $\sigma$ are satisfied by practical non-monotonic activation functions, such as SiLU/Swish and GELU. We also discuss how our positive results are related to existing negative results on training two-layer neural networks.

Related papers

A Mathematical Certification for Positivity Conditions in Neural Networks with Applications to Partial Monotonicity and Ethical AI [48.623199394622546]
LipVor is an algorithm that certifies if a black-box model, such as an ANN, is positive based on a finite number of evaluations. Our approach is able to mathematically certify if an ANN is partially monotonic without needing constrained ANN's architectures or piece-wise linear activation functions.
arXiv Detail & Related papers (2024-06-12T07:33:38Z)
Interpolation with deep neural networks with non-polynomial activations: necessary and sufficient numbers of neurons [0.0]
We prove that $Theta(sqrtnd')$ neurons are sufficient as long as the activation function is real at a point and not a point and not a there. This means that activation functions can be freely chosen in a problem-dependent manner without loss of power.
arXiv Detail & Related papers (2024-05-22T15:29:45Z)
Learning a Single Neuron with Adversarial Label Noise via Gradient Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations. The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z)
Constrained Monotonic Neural Networks [0.685316573653194]
Wider adoption of neural networks in many critical domains such as finance and healthcare is being hindered by the need to explain their predictions. Monotonicity constraint is one of the most requested properties in real-world scenarios. We show it can approximate any continuous monotone function on a compact subset of $mathbbRn$.
arXiv Detail & Related papers (2022-05-24T04:26:10Z)
Deformed semicircle law and concentration of nonlinear random matrices for ultra-wide neural networks [29.03095282348978]
We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$. We show that random feature regression induced by the empirical kernel achieves the same performance as its limiting kernel regression under the ultra-wide regime.
arXiv Detail & Related papers (2021-09-20T05:25:52Z)
Learning a Single Neuron with Bias Using Gradient Descent [53.15475693468925]
We study the fundamental problem of learning a single neuron with a bias term. We show that this is a significantly different and more challenging problem than the bias-less case.
arXiv Detail & Related papers (2021-06-02T12:09:55Z)
On the Generalization Power of Overfitted Two-Layer Neural Tangent Kernel Models [42.72822331030195]
min $ell$-norm overfitting solutions for the neural tangent kernel (NTK) model of a two-layer neural network. We show that, depending on the ground-truth function, the test error of overfitted NTK models exhibits characteristics that are different from the "double-descent" For functions outside of this class, we provide a lower bound on the generalization error that does not diminish to zero even when $n$ and $p$ are both large.
arXiv Detail & Related papers (2021-03-09T06:24:59Z)
Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network. We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z)
Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
Learning a Single Neuron with Gradient Methods [39.291483556116454]
We consider the fundamental problem of learning a single neuron $x mapstosigma(wtop x)$ using standard gradient methods. We ask whether a more general result is attainable, under milder assumptions.
arXiv Detail & Related papers (2020-01-15T10:02:45Z)
Gaussian Error Linear Units (GELUs) [58.195342948092964]
We propose a neural network activation function that weights inputs by their value, rather than gates by their sign. We find performance improvements across all considered computer vision, natural language processing, and speech tasks.
arXiv Detail & Related papers (2016-06-27T19:20:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.