Linear Oscillation: A Novel Activation Function for Vision Transformer
- URL: http://arxiv.org/abs/2308.13670v4
- Date: Fri, 1 Dec 2023 02:58:36 GMT
- Title: Linear Oscillation: A Novel Activation Function for Vision Transformer
- Authors: Juyoung Yun
- Abstract summary: We present the Linear Oscillation (LoC) activation function, defined as $f(x) = x times sin(alpha x + beta)$.
Distinct from conventional activation functions which primarily introduce non-linearity, LoC seamlessly blends linear trajectories with oscillatory deviations.
Our empirical studies reveal that, when integrated into diverse neural architectures, the LoC activation function consistently outperforms established counterparts like ReLU and Sigmoid.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Activation functions are the linchpins of deep learning, profoundly
influencing both the representational capacity and training dynamics of neural
networks. They shape not only the nature of representations but also optimize
convergence rates and enhance generalization potential. Appreciating this
critical role, we present the Linear Oscillation (LoC) activation function,
defined as $f(x) = x \times \sin(\alpha x + \beta)$. Distinct from conventional
activation functions which primarily introduce non-linearity, LoC seamlessly
blends linear trajectories with oscillatory deviations. The nomenclature
"Linear Oscillation" is a nod to its unique attribute of infusing linear
activations with harmonious oscillations, capturing the essence of the
"Importance of Confusion". This concept of "controlled confusion" within
network activations is posited to foster more robust learning, particularly in
contexts that necessitate discerning subtle patterns. Our empirical studies
reveal that, when integrated into diverse neural architectures, the LoC
activation function consistently outperforms established counterparts like ReLU
and Sigmoid. The stellar performance exhibited by the avant-garde Vision
Transformer model using LoC further validates its efficacy. This study
illuminates the remarkable benefits of the LoC over other prominent activation
functions. It champions the notion that intermittently introducing deliberate
complexity or "confusion" during training can spur more profound and nuanced
learning. This accentuates the pivotal role of judiciously selected activation
functions in shaping the future of neural network training.
Related papers
- Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.
We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.
We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z) - TSSR: A Truncated and Signed Square Root Activation Function for Neural
Networks [5.9622541907827875]
We introduce a new activation function called the Truncated and Signed Square Root (TSSR) function.
This function is distinctive because it is odd, nonlinear, monotone and differentiable.
It has the potential to improve the numerical stability of neural networks.
arXiv Detail & Related papers (2023-08-09T09:40:34Z) - ENN: A Neural Network with DCT Adaptive Activation Functions [2.2713084727838115]
We present Expressive Neural Network (ENN), a novel model in which the non-linear activation functions are modeled using the Discrete Cosine Transform (DCT)
This parametrization keeps the number of trainable parameters low, is appropriate for gradient-based schemes, and adapts to different learning tasks.
The performance of ENN outperforms state of the art benchmarks, providing above a 40% gap in accuracy in some scenarios.
arXiv Detail & Related papers (2023-07-02T21:46:30Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Evaluating CNN with Oscillatory Activation Function [0.0]
CNNs capability to learn high-dimensional complex features from the images is the non-linearity introduced by the activation function.
This paper explores the performance of one of the CNN architecture ALexNet on MNIST and CIFAR10 datasets using oscillating activation function (GCU) and some other commonly used activation functions like ReLu, PReLu, and Mish.
arXiv Detail & Related papers (2022-11-13T11:17:13Z) - Transformers with Learnable Activation Functions [63.98696070245065]
We use Rational Activation Function (RAF) to learn optimal activation functions during training according to input data.
RAF opens a new research direction for analyzing and interpreting pre-trained models according to the learned activation functions.
arXiv Detail & Related papers (2022-08-30T09:47:31Z) - Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic [137.04558017227583]
Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years.
We take a mean-field perspective on the evolution and convergence of feature-based neural AC.
We prove that neural AC finds the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2021-12-27T06:09:50Z) - Growing Cosine Unit: A Novel Oscillatory Activation Function That Can
Speedup Training and Reduce Parameters in Convolutional Neural Networks [0.1529342790344802]
Convolution neural networks have been successful in solving many socially important and economically significant problems.
Key discovery that made training deep networks feasible was the adoption of the Rectified Linear Unit (ReLU) activation function.
New activation function C(z) = z cos z outperforms Sigmoids, Swish, Mish and ReLU on a variety of architectures.
arXiv Detail & Related papers (2021-08-30T01:07:05Z) - Going Beyond Linear RL: Sample Efficient Neural Function Approximation [76.57464214864756]
We study function approximation with two-layer neural networks.
Our results significantly improve upon what can be attained with linear (or eluder dimension) methods.
arXiv Detail & Related papers (2021-07-14T03:03:56Z) - Activation function design for deep networks: linearity and effective
initialisation [10.108857371774977]
We study how to avoid two problems at initialisation identified in prior works.
We prove that both these problems can be avoided by choosing an activation function possessing a sufficiently large linear region around the origin.
arXiv Detail & Related papers (2021-05-17T11:30:46Z) - Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory [110.99247009159726]
Temporal-difference and Q-learning play a key role in deep reinforcement learning, where they are empowered by expressive nonlinear function approximators such as neural networks.
In particular, temporal-difference learning converges when the function approximator is linear in a feature representation, which is fixed throughout learning, and possibly diverges otherwise.
arXiv Detail & Related papers (2020-06-08T17:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.