Learning a Single Neuron for Non-monotonic Activation Functions
- URL: http://arxiv.org/abs/2202.08064v1
- Date: Wed, 16 Feb 2022 13:44:25 GMT
- Title: Learning a Single Neuron for Non-monotonic Activation Functions
- Authors: Lei Wu
- Abstract summary: Non-monotonic activation functions outperform the traditional monotonic ones in many applications.
We show that mild conditions on $sigma$ are sufficient to guarantee the learnability in samples time.
We also discuss how our positive results are related to existing negative results on training two-layer neural networks.
- Score: 3.890410443467757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of learning a single neuron $\mathbf{x}\mapsto
\sigma(\mathbf{w}^T\mathbf{x})$ with gradient descent (GD). All the existing
positive results are limited to the case where $\sigma$ is monotonic. However,
it is recently observed that non-monotonic activation functions outperform the
traditional monotonic ones in many applications. To fill this gap, we establish
learnability without assuming monotonicity. Specifically, when the input
distribution is the standard Gaussian, we show that mild conditions on $\sigma$
(e.g., $\sigma$ has a dominating linear part) are sufficient to guarantee the
learnability in polynomial time and polynomial samples. Moreover, with a
stronger assumption on the activation function, the condition of input
distribution can be relaxed to a non-degeneracy of the marginal distribution.
We remark that our conditions on $\sigma$ are satisfied by practical
non-monotonic activation functions, such as SiLU/Swish and GELU. We also
discuss how our positive results are related to existing negative results on
training two-layer neural networks.
Related papers
- A Mathematical Certification for Positivity Conditions in Neural Networks with Applications to Partial Monotonicity and Ethical AI [48.623199394622546]
LipVor is an algorithm that certifies if a black-box model, such as an ANN, is positive based on a finite number of evaluations.
Our approach is able to mathematically certify if an ANN is partially monotonic without needing constrained ANN's architectures or piece-wise linear activation functions.
arXiv Detail & Related papers (2024-06-12T07:33:38Z) - Interpolation with deep neural networks with non-polynomial activations: necessary and sufficient numbers of neurons [0.0]
We prove that $Theta(sqrtnd')$ neurons are sufficient as long as the activation function is real at a point and not a point and not a there.
This means that activation functions can be freely chosen in a problem-dependent manner without loss of power.
arXiv Detail & Related papers (2024-05-22T15:29:45Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Constrained Monotonic Neural Networks [0.685316573653194]
Wider adoption of neural networks in many critical domains such as finance and healthcare is being hindered by the need to explain their predictions.
Monotonicity constraint is one of the most requested properties in real-world scenarios.
We show it can approximate any continuous monotone function on a compact subset of $mathbbRn$.
arXiv Detail & Related papers (2022-05-24T04:26:10Z) - Deformed semicircle law and concentration of nonlinear random matrices
for ultra-wide neural networks [29.03095282348978]
We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$.
We show that random feature regression induced by the empirical kernel achieves the same performance as its limiting kernel regression under the ultra-wide regime.
arXiv Detail & Related papers (2021-09-20T05:25:52Z) - Learning a Single Neuron with Bias Using Gradient Descent [53.15475693468925]
We study the fundamental problem of learning a single neuron with a bias term.
We show that this is a significantly different and more challenging problem than the bias-less case.
arXiv Detail & Related papers (2021-06-02T12:09:55Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - Learning a Single Neuron with Gradient Methods [39.291483556116454]
We consider the fundamental problem of learning a single neuron $x mapstosigma(wtop x)$ using standard gradient methods.
We ask whether a more general result is attainable, under milder assumptions.
arXiv Detail & Related papers (2020-01-15T10:02:45Z) - Gaussian Error Linear Units (GELUs) [58.195342948092964]
We propose a neural network activation function that weights inputs by their value, rather than gates by their sign.
We find performance improvements across all considered computer vision, natural language processing, and speech tasks.
arXiv Detail & Related papers (2016-06-27T19:20:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.