MiniALBERT: Model Distillation via Parameter-Efficient Recursive
Transformers
- URL: http://arxiv.org/abs/2210.06425v2
- Date: Sun, 30 Apr 2023 13:00:24 GMT
- Title: MiniALBERT: Model Distillation via Parameter-Efficient Recursive
Transformers
- Authors: Mohammadmahdi Nouriborji, Omid Rohanian, Samaneh Kouchaki, David A.
Clifton
- Abstract summary: MiniALBERT is a technique for converting the knowledge of fully parameterised LMs (such as BERT) into a compact recursive student.
We test our proposed models on a number of general and biomedical NLP tasks to demonstrate their viability and compare them with the state-of-the-art and other existing compact models.
- Score: 12.432191400869002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Language Models (LMs) have become an integral part of Natural
Language Processing (NLP) in recent years, due to their superior performance in
downstream applications. In spite of this resounding success, the usability of
LMs is constrained by computational and time complexity, along with their
increasing size; an issue that has been referred to as `overparameterisation'.
Different strategies have been proposed in the literature to alleviate these
problems, with the aim to create effective compact models that nearly match the
performance of their bloated counterparts with negligible performance losses.
One of the most popular techniques in this area of research is model
distillation. Another potent but underutilised technique is cross-layer
parameter sharing. In this work, we combine these two strategies and present
MiniALBERT, a technique for converting the knowledge of fully parameterised LMs
(such as BERT) into a compact recursive student. In addition, we investigate
the application of bottleneck adapters for layer-wise adaptation of our
recursive student, and also explore the efficacy of adapter tuning for
fine-tuning of compact models. We test our proposed models on a number of
general and biomedical NLP tasks to demonstrate their viability and compare
them with the state-of-the-art and other existing compact models. All the codes
used in the experiments are available at
https://github.com/nlpie-research/MiniALBERT. Our pre-trained compact models
can be accessed from https://huggingface.co/nlpie.
Related papers
- Hyper Compressed Fine-Tuning of Large Foundation Models with Quantum Inspired Adapters [0.0]
emphQuantum-Inspired Adapters, a PEFT approach inspired by Hamming-weight quantum circuits from quantum machine learning literature.
We test our proposed adapters by adapting large language models and large vision transformers on benchmark datasets.
arXiv Detail & Related papers (2025-02-10T13:06:56Z) - FineGates: LLMs Finetuning with Compression using Stochastic Gates [7.093692674858257]
Large Language Models (LLMs) present significant challenges for full finetuning due to the high computational demands.
Lightweight finetuning techniques have been proposed, like learning low-rank adapter layers.
We propose an adaptor model based on gates that simultaneously sparsify the frozen base model with task-specific adaptation.
arXiv Detail & Related papers (2024-12-17T14:33:05Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.
Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models [0.0]
Transformer Layer Injection (TLI) is a novel method for efficiently upscaling large language models (LLMs)
Our approach improves upon the conventional Depth Up-Scaling (DUS) technique by injecting new layers into every set of K layers.
arXiv Detail & Related papers (2024-10-15T14:41:44Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Efficient Learning With Sine-Activated Low-rank Matrices [25.12262017296922]
We propose a novel theoretical framework that integrates a sinusoidal function within the low-rank decomposition process.
Our method proves to be a plug in enhancement for existing low-rank models, as evidenced by its successful application in Vision Transformers (ViT), Large Language Models (LLMs), Neural Radiance Fields (NeRF) and 3D shape modelling.
arXiv Detail & Related papers (2024-03-28T08:58:20Z) - Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth.
We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity.
Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Compressing Large-Scale Transformer-Based Models: A Case Study on BERT [41.04066537294312]
Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks.
These models often have billions of parameters, and, thus, are too resource-hungry and computation-intensive to suit low-capability devices or applications.
One potential remedy for this is model compression, which has attracted a lot of research attention.
arXiv Detail & Related papers (2020-02-27T09:20:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.