Mini-Model Adaptation: Efficiently Extending Pretrained Models to New
Languages via Aligned Shallow Training
- URL: http://arxiv.org/abs/2212.10503v2
- Date: Tue, 4 Jul 2023 19:06:55 GMT
- Title: Mini-Model Adaptation: Efficiently Extending Pretrained Models to New
Languages via Aligned Shallow Training
- Authors: Kelly Marchisio, Patrick Lewis, Yihong Chen, Mikel Artetxe
- Abstract summary: It is possible to expand pretrained Masked Language Models to new languages by learning a new set of embeddings, while keeping the transformer body frozen.
We propose mini-model adaptation, a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model's parameters.
New language-specific embeddings can then be efficiently trained over the mini-model and plugged into the aligned large model for rapid cross-lingual transfer.
- Score: 36.5936227129021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work shows that it is possible to expand pretrained Masked Language
Models (MLMs) to new languages by learning a new set of embeddings, while
keeping the transformer body frozen. Despite learning a small subset of
parameters, this approach is not compute-efficient, as training the new
embeddings requires a full forward and backward pass over the entire model. We
propose mini-model adaptation, a compute-efficient alternative that builds a
shallow mini-model from a fraction of a large model's parameters. New
language-specific embeddings can then be efficiently trained over the
mini-model and plugged into the aligned large model for rapid cross-lingual
transfer. We explore two approaches to learn mini-models: MiniJoint, which
jointly pretrains the primary model and the mini-model using a single
transformer with a secondary MLM head at a middle layer; and MiniPost, where we
start from a regular pretrained model, build a mini-model by extracting and
freezing a few layers, and learn a small number of parameters on top.
Experiments on XNLI, MLQA and PAWS-X show that mini-model adaptation matches
the performance of the standard approach using 2.3x less compute on average.
Related papers
- Cross-model Control: Improving Multiple Large Language Models in One-time Training [34.98931804630706]
Cross-model Control (CMC) is a method that improves multiple large language models in one-time training.
Based on this insight, we incorporate a tiny language model with a minimal number of parameters.
We propose a novel token mapping strategy named PM-MinED to make this tiny language model applicable to models with different vocabularies.
arXiv Detail & Related papers (2024-10-23T06:52:09Z) - Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization [22.90653167145603]
We introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions.
As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts.
arXiv Detail & Related papers (2024-09-19T16:50:26Z) - MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies [85.57899012821211]
Small Language Models (SLMs) are a resource-efficient alternative to Large Language Models (LLMs)
We introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants.
We also introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K.
arXiv Detail & Related papers (2024-04-09T15:36:50Z) - Initializing Models with Larger Ones [76.41561758293055]
We introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model.
Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time.
arXiv Detail & Related papers (2023-11-30T18:58:26Z) - PELA: Learning Parameter-Efficient Models with Low-Rank Approximation [16.9278983497498]
We propose a novel method for increasing the parameter efficiency of pre-trained models by introducing an intermediate pre-training stage.
This allows for direct and efficient utilization of the low-rank model for downstream fine-tuning tasks.
arXiv Detail & Related papers (2023-10-16T07:17:33Z) - Contrastive Alignment of Vision to Language Through Parameter-Efficient
Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training.
We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training.
We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Transfer training from smaller language model [6.982133308738434]
We find a method to save training time and resource cost by changing the small well-trained model to large model.
We test the target model on several data sets and find it is still comparable with the source model.
arXiv Detail & Related papers (2021-04-23T02:56:02Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.