Transformers with Learnable Activation Functions
- URL: http://arxiv.org/abs/2208.14111v2
- Date: Thu, 1 Sep 2022 07:55:10 GMT
- Title: Transformers with Learnable Activation Functions
- Authors: Haishuo Fang, Ji-Ung Lee, Nafise Sadat Moosavi, Iryna Gurevych
- Abstract summary: We use Rational Activation Function (RAF) to learn optimal activation functions during training according to input data.
RAF opens a new research direction for analyzing and interpreting pre-trained models according to the learned activation functions.
- Score: 63.98696070245065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Activation functions can have a significant impact on reducing the
topological complexity of input data and therefore improve the performance of
the model. Selecting a suitable activation function is an essential step in
neural model design. However, the choice of activation function is seldom
discussed or explored in Transformer-based language models. Their activation
functions are chosen beforehand and then remain fixed from pre-training to
fine-tuning. As a result, the inductive biases they imposed on models cannot be
adjusted during this long life cycle. Moreover, subsequently developed models
(e.g., RoBERTa, BART, and GPT-3) often follow up prior work (e.g., BERT) to use
the same activation function without justification. In this paper, we
investigate the effectiveness of using Rational Activation Function (RAF), a
learnable activation function, in the Transformer architecture. In contrast to
conventional, predefined activation functions, RAFs can adaptively learn
optimal activation functions during training according to input data. Our
experiments show the RAF-based Transformer (RAFT) achieves a lower validation
perplexity than a vanilla BERT with the GELU function. We further evaluate RAFT
on downstream tasks in low- and full-data settings. Our results show that RAFT
outperforms the counterpart model across the majority of tasks and settings.
For instance, RAFT outperforms vanilla BERT on the GLUE benchmark by 5.71
points on average in low-data scenario (where 100 training examples are
available) and by 2.05 points on SQuAD in full-data setting. Analysis of the
shapes of learned RAFs further unveils that they substantially vary between
different layers of the pre-trained model and mostly look very different from
conventional activation functions. RAFT opens a new research direction for
analyzing and interpreting pre-trained models according to the learned
activation functions.
Related papers
- On the Role of Activation Functions in EEG-To-Text Decoder [5.4141465747474475]
We try to improve the original performance of a first attempt at generating text using EEG.
We show that introducing a higher degree activation function can enhance model performance without changing the model architecture.
We also show that the learnable 3rd-degree activation function performs better on the 1-gram evaluation compared to a 3rd-degree non-learnable function.
arXiv Detail & Related papers (2024-10-16T13:50:04Z) - Activation function optimization method: Learnable series linear units (LSLUs) [12.089173508371246]
We propose a series-based learnable ac-tivation function called LSLU (Learnable Series Linear Units)
This method simplifies deep learning networks while im-proving accuracy.
We evaluate LSLU's performance on CIFAR10, CIFAR100, and specific task datasets (e.g., Silkworm)
arXiv Detail & Related papers (2024-08-28T11:12:27Z) - Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence [60.37934652213881]
Domain Adaptation (DA) facilitates knowledge transfer from a source domain to a related target domain.
This paper investigates a practical DA paradigm, namely Source data-Free Active Domain Adaptation (SFADA), where source data becomes inaccessible during adaptation.
We present learn from the learnt (LFTL), a novel paradigm for SFADA to leverage the learnt knowledge from the source pretrained model and actively iterated models without extra overhead.
arXiv Detail & Related papers (2024-07-26T17:51:58Z) - A Method on Searching Better Activation Functions [15.180864683908878]
We propose Entropy-based Activation Function Optimization (EAFO) methodology for designing static activation functions in deep neural networks.
We derive a novel activation function from ReLU, known as Correction Regularized ReLU (CRReLU)
arXiv Detail & Related papers (2024-05-19T03:48:05Z) - Efficient Activation Function Optimization through Surrogate Modeling [15.219959721479835]
This paper aims to improve the state of the art through three steps.
First, the benchmark Act-Bench-CNN, Act-Bench-ResNet, and Act-Bench-ViT were created by training convolutional, residual, and vision transformer architectures.
Second, a characterization of the benchmark space was developed, leading to a new surrogate-based method for optimization.
arXiv Detail & Related papers (2023-01-13T23:11:14Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - Hyperparameter-free Continuous Learning for Domain Classification in
Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU)
Most existing continual learning approaches suffer from low accuracy and performance fluctuation.
We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z) - FastIF: Scalable Influence Functions for Efficient Model Interpretation
and Debugging [112.19994766375231]
Influence functions approximate the 'influences' of training data-points for test predictions.
We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time.
Our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors.
arXiv Detail & Related papers (2020-12-31T18:02:34Z) - Discovering Parametric Activation Functions [17.369163074697475]
This paper proposes a technique for customizing activation functions automatically, resulting in reliable improvements in performance.
Experiments with four different neural network architectures on the CIFAR-10 and CIFAR-100 image classification datasets show that this approach is effective.
arXiv Detail & Related papers (2020-06-05T00:25:33Z) - Parameter-Efficient Transfer from Sequential Behaviors for User Modeling
and Recommendation [111.44445634272235]
In this paper, we develop a parameter efficient transfer learning architecture, termed as PeterRec.
PeterRec allows the pre-trained parameters to remain unaltered during fine-tuning by injecting a series of re-learned neural networks.
We perform extensive experimental ablation to show the effectiveness of the learned user representation in five downstream tasks.
arXiv Detail & Related papers (2020-01-13T14:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.