giMLPs: Gate with Inhibition Mechanism in MLPs
- URL: http://arxiv.org/abs/2208.00929v2
- Date: Tue, 2 Aug 2022 09:51:47 GMT
- Title: giMLPs: Gate with Inhibition Mechanism in MLPs
- Authors: Cheng Kang, Jindich Prokop, Lei Tong, Huiyu Zhou, Yong Hu, Daneil
Novak
- Abstract summary: Gate with inhibition (giMLP) can produce equal performance on the ImageNet classification task.
Gate With Inhibition can achieve appealing results on most parts of NLU tasks without any extra pretraining again.
Experiments on ImageNet and twelve language downstream tasks demonstrate the effectiveness of Gate With Inhibition.
- Score: 13.288519661160898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a new model architecture, gate with inhibition MLP
(giMLP).The gate with inhibition on CycleMLP (gi-CycleMLP) can produce equal
performance on the ImageNet classification task, and it also improves the BERT,
Roberta, and DeBERTaV3 models depending on two novel techniques. The first is
the gating MLP, where matrix multiplications between the MLP and the trunk
Attention input in further adjust models' adaptation. The second is inhibition
which inhibits or enhances the branch adjustment, and with the inhibition
levels increasing, it offers models more muscular features restriction. We show
that the giCycleMLP with a lower inhibition level can be competitive with the
original CycleMLP in terms of ImageNet classification accuracy. In addition, we
also show through a comprehensive empirical study that these techniques
significantly improve the performance of fine-tuning NLU downstream tasks. As
for the gate with inhibition MLPs on DeBERTa (giDeBERTa) fine-tuning, we find
it can achieve appealing results on most parts of NLU tasks without any extra
pretraining again. We also find that with the use of Gate With Inhibition, the
activation function should have a short and smooth negative tail, with which
the unimportant features or the features that hurt models can be moderately
inhibited. The experiments on ImageNet and twelve language downstream tasks
demonstrate the effectiveness of Gate With Inhibition, both for image
classification and for enhancing the capacity of nature language fine-tuning
without any extra pretraining.
Related papers
- OP-LoRA: The Blessing of Dimensionality [93.08208871549557]
Low-rank adapters enable fine-tuning of large models with only a small number of parameters.
They often pose optimization challenges, with poor convergence.
We introduce an over- parameterized approach that accelerates training without increasing inference costs.
We achieve improvements in vision-language tasks and especially notable increases in image generation.
arXiv Detail & Related papers (2024-12-13T18:55:19Z) - MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP)
MVP improves performance against adversarial substitutions by an average of 8% over standard methods.
We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.