Deeper Learning with CoLU Activation
- URL: http://arxiv.org/abs/2112.12078v1
- Date: Sat, 18 Dec 2021 21:11:11 GMT
- Title: Deeper Learning with CoLU Activation
- Authors: Advait Vagerwal
- Abstract summary: CoLU is an activation function similar to Swish and Mish in properties.
It is observed that CoLU usually performs better than other functions on deeper neural networks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In neural networks, non-linearity is introduced by activation functions. One
commonly used activation function is Rectified Linear Unit (ReLU). ReLU has
been a popular choice as an activation but has flaws. State-of-the-art
functions like Swish and Mish are now gaining attention as a better choice as
they combat many flaws presented by other activation functions. CoLU is an
activation function similar to Swish and Mish in properties. It is defined as
f(x)=x/(1-xe^-(x+e^x)). It is smooth, continuously differentiable, unbounded
above, bounded below, non-saturating, and non-monotonic. Based on experiments
done with CoLU with different activation functions, it is observed that CoLU
usually performs better than other functions on deeper neural networks. While
training different neural networks on MNIST on an incrementally increasing
number of convolutional layers, CoLU retained the highest accuracy for more
layers. On a smaller network with 8 convolutional layers, CoLU had the highest
mean accuracy, closely followed by ReLU. On VGG-13 trained on Fashion-MNIST,
CoLU had a 4.20% higher accuracy than Mish and 3.31% higher accuracy than ReLU.
On ResNet-9 trained on Cifar-10, CoLU had 0.05% higher accuracy than Swish,
0.09% higher accuracy than Mish, and 0.29% higher accuracy than ReLU. It is
observed that activation functions may behave better than other activation
functions based on different factors including the number of layers, types of
layers, number of parameters, learning rate, optimizer, etc. Further research
can be done on these factors and activation functions for more optimal
activation functions and more knowledge on their behavior.
Related papers
- Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.
We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.
We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z) - Activation function optimization method: Learnable series linear units (LSLUs) [12.089173508371246]
We propose a series-based learnable ac-tivation function called LSLU (Learnable Series Linear Units)
This method simplifies deep learning networks while im-proving accuracy.
We evaluate LSLU's performance on CIFAR10, CIFAR100, and specific task datasets (e.g., Silkworm)
arXiv Detail & Related papers (2024-08-28T11:12:27Z) - ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse
LLMs [91.31204876440765]
We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold.
To find the most efficient activation function for sparse computation, we propose a systematic framework.
We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$2$.
arXiv Detail & Related papers (2024-02-06T08:45:51Z) - TaLU: A Hybrid Activation Function Combining Tanh and Rectified Linear
Unit to Enhance Neural Networks [1.3477333339913569]
TaLU is a modified activation function combining Tanh and ReLU, which mitigates the dying gradient problem of ReLU.
Deep learning model with the proposed activation function was tested on MNIST and CIFAR-10.
arXiv Detail & Related papers (2023-05-08T01:13:59Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Activation Functions: Dive into an optimal activation function [1.52292571922932]
We find an optimal activation function by defining it as a weighted sum of existing activation functions.
The study uses three activation functions, ReLU, tanh, and sin, over three popular image datasets.
arXiv Detail & Related papers (2022-02-24T12:44:11Z) - Growing Cosine Unit: A Novel Oscillatory Activation Function That Can
Speedup Training and Reduce Parameters in Convolutional Neural Networks [0.1529342790344802]
Convolution neural networks have been successful in solving many socially important and economically significant problems.
Key discovery that made training deep networks feasible was the adoption of the Rectified Linear Unit (ReLU) activation function.
New activation function C(z) = z cos z outperforms Sigmoids, Swish, Mish and ReLU on a variety of architectures.
arXiv Detail & Related papers (2021-08-30T01:07:05Z) - Piecewise Linear Units Improve Deep Neural Networks [0.0]
The activation function is at the heart of a deep neural networks nonlinearity.
Currently, many practitioners prefer the Rectified Linear Unit (ReLU) due to its simplicity and reliability.
We propose an adaptive piecewise linear activation function, the Piecewise Linear Unit (PiLU), which can be learned independently for each dimension of the neural network.
arXiv Detail & Related papers (2021-08-02T08:09:38Z) - Comparisons among different stochastic selection of activation layers
for convolutional neural networks for healthcare [77.99636165307996]
We classify biomedical images using ensembles of neural networks.
We select our activations among the following ones: ReLU, leaky ReLU, Parametric ReLU, ELU, Adaptive Piecewice Linear Unit, S-Shaped ReLU, Swish, Mish, Mexican Linear Unit, Parametric Deformable Linear Unit, Soft Root Sign.
arXiv Detail & Related papers (2020-11-24T01:53:39Z) - Dynamic ReLU [74.973224160508]
We propose dynamic ReLU (DY-ReLU), a dynamic input of parameters which are generated by a hyper function over all in-put elements.
Compared to its static counterpart, DY-ReLU has negligible extra computational cost, but significantly more representation capability.
By simply using DY-ReLU for MobileNetV2, the top-1 accuracy on ImageNet classification is boosted from 72.0% to 76.2% with only 5% additional FLOPs.
arXiv Detail & Related papers (2020-03-22T23:45:35Z) - Gaussian Error Linear Units (GELUs) [58.195342948092964]
We propose a neural network activation function that weights inputs by their value, rather than gates by their sign.
We find performance improvements across all considered computer vision, natural language processing, and speech tasks.
arXiv Detail & Related papers (2016-06-27T19:20:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.