Related papers: LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks

LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks

URL: http://arxiv.org/abs/1901.05894v4
Date: Fri, 17 Feb 2023 01:49:12 GMT
Title: LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks
Authors: Swalpa Kumar Roy, Suvojit Manna, Shiv Ram Dubey, Bidyut Baran Chaudhuri
Abstract summary: We propose a Linearly Scaled Hyperbolic Tangent (LiSHT) for Neural Networks (NNs) by scaling the Tanh linearly. We observe the superior performance using Multi-layer Perceptron (MLP), Residual Network (ResNet) and Long-short term memory (LSTM) for data classification, image classification and tweets classification tasks.
Score: 14.943863837083496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The activation function in neural network introduces the non-linearity required to deal with the complex tasks. Several activation/non-linearity functions are developed for deep learning models. However, most of the existing activation functions suffer due to the dying gradient problem and non-utilization of the large negative input values. In this paper, we propose a Linearly Scaled Hyperbolic Tangent (LiSHT) for Neural Networks (NNs) by scaling the Tanh linearly. The proposed LiSHT is non-parametric and tackles the dying gradient problem. We perform the experiments on benchmark datasets of different type, such as vector data, image data and natural language data. We observe the superior performance using Multi-layer Perceptron (MLP), Residual Network (ResNet) and Long-short term memory (LSTM) for data classification, image classification and tweets classification tasks, respectively. The accuracy on CIFAR100 dataset using ResNet model with LiSHT is improved by 9.48, 3.40, 3.16, 4.26, and 1.17\% as compared to Tanh, ReLU, PReLU, LReLU, and Swish, respectively. We also show the qualitative results using loss landscape, weight distribution and activations maps in support of the proposed activation function.

Related papers

Robust Deep Network Learning of Nonlinear Regression Tasks by Parametric Leaky Exponential Linear Units (LELUs) and a Diffusion Metric [0.0]
nonlinear ac.f.'s are required for learning nonlinear datasets.<n>Smooth but vanishing-gradient ac.f.'s such as ELU or SiLU have limited performance.<n>Non-smooth ac.f.'s such as RELU and Leaky-RELU impart discontinuity in the trained model.
arXiv Detail & Related papers (2025-07-09T11:49:15Z)
Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
A Novel Explanation Against Linear Neural Networks [1.223779595809275]
Linear Regression and neural networks are widely used to model data. We show that neural networks without activation functions, or linear neural networks, actually reduce both training and testing performance. We prove this hypothesis through an analysis of the optimization of an LNN and rigorous testing comparing the performance between both LNNs and linear regression on noisy datasets.
arXiv Detail & Related papers (2023-12-30T09:44:51Z)
ReLU soothes the NTK condition number and accelerates optimization for wide neural networks [9.374151703899047]
We show that ReLU leads to: it better separation for similar data, and it better conditioning of neural tangent kernel (NTK) Our results imply that ReLU activation, as well as the depth of ReLU network, helps improve the gradient descent convergence rate.
arXiv Detail & Related papers (2023-05-15T17:22:26Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs) Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space. This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z)
Graph-adaptive Rectified Linear Unit for Graph Neural Networks [64.92221119723048]
Graph Neural Networks (GNNs) have achieved remarkable success by extending traditional convolution to learning on non-Euclidean data. We propose Graph-adaptive Rectified Linear Unit (GReLU) which is a new parametric activation function incorporating the neighborhood information in a novel and efficient way. We conduct comprehensive experiments to show that our plug-and-play GReLU method is efficient and effective given different GNN backbones and various downstream tasks.
arXiv Detail & Related papers (2022-02-13T10:54:59Z)
Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets. We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels. We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z)
Learning specialized activation functions with the Piecewise Linear Unit [7.820667552233989]
We propose a new activation function called Piecewise Linear Unit(PWLU), which incorporates a carefully designed formulation and learning method. It can learn specialized activation functions and achieves SOTA performance on large-scale datasets like ImageNet and COCO. PWLU is also easy to implement and efficient at inference, which can be widely applied in real-world applications.
arXiv Detail & Related papers (2021-04-08T11:29:11Z)
Comparisons among different stochastic selection of activation layers for convolutional neural networks for healthcare [77.99636165307996]
We classify biomedical images using ensembles of neural networks. We select our activations among the following ones: ReLU, leaky ReLU, Parametric ReLU, ELU, Adaptive Piecewice Linear Unit, S-Shaped ReLU, Swish, Mish, Mexican Linear Unit, Parametric Deformable Linear Unit, Soft Root Sign.
arXiv Detail & Related papers (2020-11-24T01:53:39Z)
Overcoming Overfitting and Large Weight Update Problem in Linear Rectifiers: Thresholded Exponential Rectified Linear Units [0.0]
"Thresholded exponential rectified linear unit" (TERELU) activation function works better in alleviating in overfitting: large weight update problem. We will show better performance on the various using neural networks, considering TERELU activation method compared to other activation datasets.
arXiv Detail & Related papers (2020-06-04T11:55:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.