MLP Fusion: Towards Efficient Fine-tuning of Dense and Mixture-of-Experts Language Models
- URL: http://arxiv.org/abs/2307.08941v3
- Date: Mon, 06 Jan 2025 05:08:53 GMT
- Title: MLP Fusion: Towards Efficient Fine-tuning of Dense and Mixture-of-Experts Language Models
- Authors: Mengting Ai, Tianxin Wei, Yifan Chen, Zeming Guo, Jingrui He,
- Abstract summary: Fine-tuning a pre-trained language model (PLM) emerges as the predominant strategy in many natural language processing applications.
General approaches (e.g. quantization and distillation) have been widely studied to reduce the compute/memory of PLM fine-tuning.
We propose one-shot compression techniques specifically designed for fine-tuning.
- Score: 33.86069537521178
- License:
- Abstract: Fine-tuning a pre-trained language model (PLM) emerges as the predominant strategy in many natural language processing applications. However, this process is known to be expensive, especially on edge devices with low computing power. While general approaches (e.g. quantization and distillation) have been widely studied to reduce the compute/memory of PLM fine-tuning, one-shot compression techniques specifically designed for fine-tuning remain largely unexplored. In this paper, we investigate the neural tangent kernel (NTK)--which reveals the gradient descent dynamics of neural networks--of the multilayer perceptrons (MLP) modules in a PLM and propose to coin a lightweight PLM through NTK-approximating MLP fusion. By incorporating NTK into the compression process, MLP Fusion not only preserves the original model's output but also maintains its training dynamics. To achieve this, we reconsider the MLP as a bundle of sub-MLPs and cluster them into a given number of centroids, which can then be restored as a compressed MLP and surprisingly well approximate the NTK of the original PLM. Our approach is applicable to both standard MLP modules and Mixture-of-Experts (MoE) modules in PLMs, demonstrating its scalability and versatility. Additionally, we provide theoretical derivations to demonstrate how the proposed compression preserves the NTK. Extensive experiments of PLM fine-tuning on both natural language understanding and generation tasks are provided to verify the effectiveness of MLP fusion. Our code is available at https://github.com/weitianxin/MLP_Fusion.
Related papers
- FSMLP: Modelling Channel Dependencies With Simplex Theory Based Multi-Layer Perceptions In Frequency Domain [16.693117400535833]
Time series forecasting (TSF) plays a crucial role in various domains, including web data analysis, energy consumption prediction, and weather forecasting.
While Multi-Layer Perceptrons (MLPs) are lightweight and effective for capturing temporal dependencies, they are prone to overfitting when used to model inter-channel dependencies.
We introduce a novel Simplex-MLP layer, where the weights are constrained within a standard simplex. This strategy encourages the model to learn simpler patterns and thereby reducing overfitting to extreme values.
arXiv Detail & Related papers (2024-12-02T16:04:15Z) - MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - Training Multilayer Perceptrons by Sampling with Quantum Annealers [38.046974698940545]
Many neural networks for vision applications are feedforward structures.
Backpropagation is currently the most effective technique to trains for supervised learning.
arXiv Detail & Related papers (2023-03-22T07:40:01Z) - Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP)
MVP improves performance against adversarial substitutions by an average of 8% over standard methods.
We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - Using Fitness Dependent Optimizer for Training Multi-layer Perceptron [13.280383503879158]
This study presents a novel training algorithm depending upon the recently proposed Fitness Dependent (FDO)
The stability of this algorithm has been verified and performance-proofed in both the exploration and exploitation stages.
The proposed approach using FDO as a trainer can outperform the other approaches using different trainers on the dataset.
arXiv Detail & Related papers (2022-01-03T10:23:17Z) - RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality [113.1414517605892]
We propose a methodology, Locality Injection, to incorporate local priors into an FC layer.
RepMLPNet is the first that seamlessly transfer to Cityscapes semantic segmentation.
arXiv Detail & Related papers (2021-12-21T10:28:17Z) - MLP Architectures for Vision-and-Language Modeling: An Empirical Study [91.6393550858739]
We initiate the first empirical study on the use of architectures for vision-and-featured (VL) fusion.
We find that without pre-training, usings for multimodal fusion has a noticeable performance gap compared to transformers.
Instead of heavy multi-head attention, adding tiny one-head attention to encoders is sufficient to achieve comparable performance to transformers.
arXiv Detail & Related papers (2021-12-08T18:26:19Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - Rethinking Token-Mixing MLP for MLP-based Vision Backbone [34.47616917228978]
We propose an improved structure as termed Circulant Channel-Specific (CCS) token-mixing benchmark, which is spatial-invariant and channel-specific.
It takes fewer parameters but achieves higher classification accuracy on ImageNet1K.
arXiv Detail & Related papers (2021-06-28T17:59:57Z) - Neural Collaborative Filtering vs. Matrix Factorization Revisited [20.237381375881228]
Embedding based models have been the state of the art in collaborative filtering for over a decade.
In recent years, it was suggested to replace the dot product with a learned similarity e.g. using a multilayer perceptron (MLP)
arXiv Detail & Related papers (2020-05-19T18:07:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.