Related papers: Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training

Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training

URL: http://arxiv.org/abs/2212.10503v2
Date: Tue, 4 Jul 2023 19:06:55 GMT
Title: Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training
Authors: Kelly Marchisio, Patrick Lewis, Yihong Chen, Mikel Artetxe
Abstract summary: It is possible to expand pretrained Masked Language Models to new languages by learning a new set of embeddings, while keeping the transformer body frozen. We propose mini-model adaptation, a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model's parameters. New language-specific embeddings can then be efficiently trained over the mini-model and plugged into the aligned large model for rapid cross-lingual transfer.
Score: 36.5936227129021
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings, while keeping the transformer body frozen. Despite learning a small subset of parameters, this approach is not compute-efficient, as training the new embeddings requires a full forward and backward pass over the entire model. We propose mini-model adaptation, a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model's parameters. New language-specific embeddings can then be efficiently trained over the mini-model and plugged into the aligned large model for rapid cross-lingual transfer. We explore two approaches to learn mini-models: MiniJoint, which jointly pretrains the primary model and the mini-model using a single transformer with a secondary MLM head at a middle layer; and MiniPost, where we start from a regular pretrained model, build a mini-model by extracting and freezing a few layers, and learn a small number of parameters on top. Experiments on XNLI, MLQA and PAWS-X show that mini-model adaptation matches the performance of the standard approach using 2.3x less compute on average.

Related papers

P2W: From Power Traces to Weights Matrix -- An Unconventional Transfer Learning Approach [1.1383507019490222]
The rapid growth of deploying machine learning (ML) models within embedded systems on a chip (SoCs) has led to transformative shifts in fields like healthcare and autonomous vehicles. One of the primary challenges for training such embedded ML models is the lack of publicly available high-quality training data. We introduce a novel unconventional transfer learning approach to train a new ML model by extracting and using weights from an existing ML model.
arXiv Detail & Related papers (2025-02-20T19:05:28Z)
Cross-model Control: Improving Multiple Large Language Models in One-time Training [34.98931804630706]
Cross-model Control (CMC) is a method that improves multiple large language models in one-time training. Based on this insight, we incorporate a tiny language model with a minimal number of parameters. We propose a novel token mapping strategy named PM-MinED to make this tiny language model applicable to models with different vocabularies.
arXiv Detail & Related papers (2024-10-23T06:52:09Z)
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization [22.90653167145603]
We introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts.
arXiv Detail & Related papers (2024-09-19T16:50:26Z)
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies [85.57899012821211]
Small Language Models (SLMs) are a resource-efficient alternative to Large Language Models (LLMs) We introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants. We also introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K.
arXiv Detail & Related papers (2024-04-09T15:36:50Z)
Initializing Models with Larger Ones [76.41561758293055]
We introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time.
arXiv Detail & Related papers (2023-11-30T18:58:26Z)
PELA: Learning Parameter-Efficient Models with Low-Rank Approximation [16.9278983497498]
We propose a novel method for increasing the parameter efficiency of pre-trained models by introducing an intermediate pre-training stage. This allows for direct and efficient utilization of the low-rank model for downstream fine-tuning tasks.
arXiv Detail & Related papers (2023-10-16T07:17:33Z)
Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training. We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Transfer training from smaller language model [6.982133308738434]
We find a method to save training time and resource cost by changing the small well-trained model to large model. We test the target model on several data sets and find it is still comparable with the source model.
arXiv Detail & Related papers (2021-04-23T02:56:02Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.