giMLPs: Gate with Inhibition Mechanism in MLPs
- URL: http://arxiv.org/abs/2208.00929v2
- Date: Tue, 2 Aug 2022 09:51:47 GMT
- Title: giMLPs: Gate with Inhibition Mechanism in MLPs
- Authors: Cheng Kang, Jindich Prokop, Lei Tong, Huiyu Zhou, Yong Hu, Daneil
Novak
- Abstract summary: Gate with inhibition (giMLP) can produce equal performance on the ImageNet classification task.
Gate With Inhibition can achieve appealing results on most parts of NLU tasks without any extra pretraining again.
Experiments on ImageNet and twelve language downstream tasks demonstrate the effectiveness of Gate With Inhibition.
- Score: 13.288519661160898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a new model architecture, gate with inhibition MLP
(giMLP).The gate with inhibition on CycleMLP (gi-CycleMLP) can produce equal
performance on the ImageNet classification task, and it also improves the BERT,
Roberta, and DeBERTaV3 models depending on two novel techniques. The first is
the gating MLP, where matrix multiplications between the MLP and the trunk
Attention input in further adjust models' adaptation. The second is inhibition
which inhibits or enhances the branch adjustment, and with the inhibition
levels increasing, it offers models more muscular features restriction. We show
that the giCycleMLP with a lower inhibition level can be competitive with the
original CycleMLP in terms of ImageNet classification accuracy. In addition, we
also show through a comprehensive empirical study that these techniques
significantly improve the performance of fine-tuning NLU downstream tasks. As
for the gate with inhibition MLPs on DeBERTa (giDeBERTa) fine-tuning, we find
it can achieve appealing results on most parts of NLU tasks without any extra
pretraining again. We also find that with the use of Gate With Inhibition, the
activation function should have a short and smooth negative tail, with which
the unimportant features or the features that hurt models can be moderately
inhibited. The experiments on ImageNet and twelve language downstream tasks
demonstrate the effectiveness of Gate With Inhibition, both for image
classification and for enhancing the capacity of nature language fine-tuning
without any extra pretraining.
Related papers
- MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many vision tasks.
Much less research has been devoted to the channel mixer or feature mixing block (FFN or)
We show that the dense connections can be replaced with a diagonal block structure that supports larger expansion ratios.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP)
MVP improves performance against adversarial substitutions by an average of 8% over standard methods.
We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - NOSMOG: Learning Noise-robust and Structure-aware MLPs on Graphs [41.85649409565574]
Graph Networks (GNNs) have demonstrated their efficacy in dealing with non-Euclidean structural data.
Existing methods attempt to address this scalability issue by training multi-layer perceptrons (MLPs) exclusively on node content features.
In this paper, we propose to learn NOise-robust Structure-awares On Graphs (NOSMOG) to overcome the challenges.
arXiv Detail & Related papers (2022-08-22T01:47:07Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.