NTK-approximating MLP Fusion for Efficient Language Model Fine-tuning
- URL: http://arxiv.org/abs/2307.08941v2
- Date: Sat, 5 Aug 2023 01:10:06 GMT
- Title: NTK-approximating MLP Fusion for Efficient Language Model Fine-tuning
- Authors: Tianxin Wei, Zeming Guo, Yifan Chen, Jingrui He
- Abstract summary: Fine-tuning a pre-trained language model (PLM) emerges as the predominant strategy in many natural language processing applications.
Some general approaches (e.g. quantization and distillation) have been widely studied to reduce the compute/memory of PLM fine-tuning.
We propose to coin a lightweight PLM through NTK-approximating modules in fusion.
- Score: 40.994306592119266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning a pre-trained language model (PLM) emerges as the predominant
strategy in many natural language processing applications. However, even
fine-tuning the PLMs and doing inference are expensive, especially on edge
devices with low computing power. Some general approaches (e.g. quantization
and distillation) have been widely studied to reduce the compute/memory of PLM
fine-tuning, while very few one-shot compression techniques are explored. In
this paper, we investigate the neural tangent kernel (NTK)--which reveals the
gradient descent dynamics of neural networks--of the multilayer perceptrons
(MLP) modules in a PLM and propose to coin a lightweight PLM through
NTK-approximating MLP fusion. To achieve this, we reconsider the MLP as a
bundle of sub-MLPs, and cluster them into a given number of centroids, which
can then be restored as a compressed MLP and surprisingly shown to well
approximate the NTK of the original PLM. Extensive experiments of PLM
fine-tuning on both natural language understanding (NLU) and generation (NLG)
tasks are provided to verify the effectiveness of the proposed method MLP
fusion. Our code is available at https://github.com/weitianxin/MLP_Fusion.
Related papers
- MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - Improved Implicit Neural Representation with Fourier Reparameterized Training [21.93903328906775]
Implicit Neural Representation (INR) as a mighty representation paradigm has achieved success in various computer vision tasks recently.
Existing methods have investigated advanced techniques, such as positional encoding and periodic activation function, to improve the accuracy of INR.
arXiv Detail & Related papers (2024-01-15T00:40:41Z) - SiT-MLP: A Simple MLP with Point-wise Topology Feature Learning for Skeleton-based Action Recognition [9.673505408890435]
Graph networks (GCNs) have achieved remarkable performance in skeleton-based action recognition.
Previous GCN-based methods rely on elaborate human priors excessively and construct complex feature aggregation mechanisms.
We propose a novel model, SiT-MLP, for skeleton-based action recognition in this work.
arXiv Detail & Related papers (2023-08-30T13:20:54Z) - Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP)
MVP improves performance against adversarial substitutions by an average of 8% over standard methods.
We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z) - Towards Robust Low-Resource Fine-Tuning with Multi-View Compressed
Representations [51.75960511842552]
Fine-tuning of pretrained language models (PLMs) is prone to overfitting in the low resource scenarios.
We present a novel method that operates on the hidden representations of a PLM to reduce overfitting.
arXiv Detail & Related papers (2022-11-16T09:39:29Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - Using Fitness Dependent Optimizer for Training Multi-layer Perceptron [13.280383503879158]
This study presents a novel training algorithm depending upon the recently proposed Fitness Dependent (FDO)
The stability of this algorithm has been verified and performance-proofed in both the exploration and exploitation stages.
The proposed approach using FDO as a trainer can outperform the other approaches using different trainers on the dataset.
arXiv Detail & Related papers (2022-01-03T10:23:17Z) - RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality [113.1414517605892]
We propose a methodology, Locality Injection, to incorporate local priors into an FC layer.
RepMLPNet is the first that seamlessly transfer to Cityscapes semantic segmentation.
arXiv Detail & Related papers (2021-12-21T10:28:17Z) - Rethinking Token-Mixing MLP for MLP-based Vision Backbone [34.47616917228978]
We propose an improved structure as termed Circulant Channel-Specific (CCS) token-mixing benchmark, which is spatial-invariant and channel-specific.
It takes fewer parameters but achieves higher classification accuracy on ImageNet1K.
arXiv Detail & Related papers (2021-06-28T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.