Related papers: Gaussian Error Linear Units (GELUs)

Related papers

ULU: A Unified Activation Function [0.0]
ULU treats positive and negative inputs differently.<n>Experiments demonstrate ULU significantly outperforms ReLU and Mish across image classification and object detection tasks.
arXiv Detail & Related papers (2025-08-07T06:58:22Z)
Gompertz Linear Units: Leveraging Asymmetry for Enhanced Learning Dynamics [39.0860823332923]
GoLU is a novel self-gated activation function defined as $mathrmGoLU(x) = x, mathrmGompertz(x)$, wheremathrmGompertz(x) = e-e-x$. GoLU's superior performance relative to state-of-the-art activation functions, highlights GoLU as a robust alternative to existing activation functions.
arXiv Detail & Related papers (2025-02-05T22:32:22Z)
Deriving Activation Functions Using Integration [8.345753173238956]
We introduce the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations. xIELU combines two key properties for the gradient: (1) a trainable and linearly increasing gradient for positive inputs, similar to Squared ReLU (ReLU$2$), and (2) a trainable gradient that can take negative values for negative inputs, inspired by Expanded SiLU (xSiLU) In experiments with 1.1B and 3B parameter Llama models trained on 125B tokens of FineWeb Edu, xIELU achieves lower
arXiv Detail & Related papers (2024-11-20T03:24:21Z)
ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs [91.31204876440765]
We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold. To find the most efficient activation function for sparse computation, we propose a systematic framework. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$2$.
arXiv Detail & Related papers (2024-02-06T08:45:51Z)
A Unifying Generator Loss Function for Generative Adversarial Networks [5.5575224613422725]
A unifying $alpha$-parametrized generator loss function is introduced for a dual-objective generative adversarial network (GAN) The generator loss function is based on a symmetric class probability estimation type function, $mathcalL_alpha$, and the resulting GAN system is termed $mathcalL_alpha$-GAN.
arXiv Detail & Related papers (2023-08-14T16:16:31Z)
Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z)
Learning a Single Neuron for Non-monotonic Activation Functions [3.890410443467757]
Non-monotonic activation functions outperform the traditional monotonic ones in many applications. We show that mild conditions on $sigma$ are sufficient to guarantee the learnability in samples time. We also discuss how our positive results are related to existing negative results on training two-layer neural networks.
arXiv Detail & Related papers (2022-02-16T13:44:25Z)
Graph-adaptive Rectified Linear Unit for Graph Neural Networks [64.92221119723048]
Graph Neural Networks (GNNs) have achieved remarkable success by extending traditional convolution to learning on non-Euclidean data. We propose Graph-adaptive Rectified Linear Unit (GReLU) which is a new parametric activation function incorporating the neighborhood information in a novel and efficient way. We conduct comprehensive experiments to show that our plug-and-play GReLU method is efficient and effective given different GNN backbones and various downstream tasks.
arXiv Detail & Related papers (2022-02-13T10:54:59Z)
Convergence and Sample Complexity of SGD in GANs [15.25030172685628]
We provide convergence guarantees on training Generative Adversarial Networks (GANs) via SGD. We consider learning a target distribution modeled by a 1-layer Generator network with a non-linear activation function. Our results apply to a broad class of non-linear activation functions $phi$, including ReLUs and is enabled by a connection with truncated statistics.
arXiv Detail & Related papers (2020-12-01T18:50:38Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$ We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
Overcoming Overfitting and Large Weight Update Problem in Linear Rectifiers: Thresholded Exponential Rectified Linear Units [0.0]
"Thresholded exponential rectified linear unit" (TERELU) activation function works better in alleviating in overfitting: large weight update problem. We will show better performance on the various using neural networks, considering TERELU activation method compared to other activation datasets.
arXiv Detail & Related papers (2020-06-04T11:55:47Z)
Least $k$th-Order and R\'{e}nyi Generative Adversarial Networks [12.13405065406781]
Experimental results indicate that the proposed loss functions, applied to the MNIST and CelebA datasets, confer performance benefits by virtue of the extra degrees of freedom provided by the parameters $k$ and $alpha$, respectively. While it was applied to GANs in this study, the proposed approach is generic and can be used in other applications of information theory to deep learning, e.g., the issues of fairness or privacy in artificial intelligence.
arXiv Detail & Related papers (2020-06-03T18:44:05Z)
Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.