An Emulator for Fine-Tuning Large Language Models using Small Language
Models
- URL: http://arxiv.org/abs/2310.12962v1
- Date: Thu, 19 Oct 2023 17:57:16 GMT
- Title: An Emulator for Fine-Tuning Large Language Models using Small Language
Models
- Authors: Eric Mitchell, Rafael Rafailov, Archit Sharma, Chelsea Finn,
Christopher D. Manning
- Abstract summary: We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
- Score: 91.02498576056057
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Widely used language models (LMs) are typically built by scaling up a
two-stage training pipeline: a pre-training stage that uses a very large,
diverse dataset of text and a fine-tuning (sometimes, 'alignment') stage that
uses targeted examples or other specifications of desired behaviors. While it
has been hypothesized that knowledge and skills come from pre-training, and
fine-tuning mostly filters this knowledge and skillset, this intuition has not
been extensively tested. To aid in doing so, we introduce a novel technique for
decoupling the knowledge and skills gained in these two stages, enabling a
direct answer to the question, "What would happen if we combined the knowledge
learned by a large model during pre-training with the knowledge learned by a
small model during fine-tuning (or vice versa)?" Using an RL-based framework
derived from recent developments in learning from human preferences, we
introduce emulated fine-tuning (EFT), a principled and practical method for
sampling from a distribution that approximates (or 'emulates') the result of
pre-training and fine-tuning at different scales. Our experiments with EFT show
that scaling up fine-tuning tends to improve helpfulness, while scaling up
pre-training tends to improve factuality. Beyond decoupling scale, we show that
EFT enables test-time adjustment of competing behavioral traits like
helpfulness and harmlessness without additional training. Finally, a special
case of emulated fine-tuning, which we call LM up-scaling, avoids
resource-intensive fine-tuning of large pre-trained models by ensembling them
with small fine-tuned models, essentially emulating the result of fine-tuning
the large pre-trained model. Up-scaling consistently improves helpfulness and
factuality of instruction-following models in the Llama, Llama-2, and Falcon
families, without additional hyperparameters or training.
Related papers
- The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [60.52921835351632]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views [28.081794908107604]
Fine-tuning is used to leverage the power of pre-trained foundation models in new downstream tasks.
Recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions.
We propose a novel generalizable fine-tuning method LEVI, where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model.
arXiv Detail & Related papers (2024-02-07T08:16:40Z) - PILOT: A Pre-Trained Model-Based Continual Learning Toolbox [71.63186089279218]
This paper introduces a pre-trained model-based continual learning toolbox known as PILOT.
On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt.
On the other hand, PILOT fits typical class-incremental learning algorithms within the context of pre-trained models to evaluate their effectiveness.
arXiv Detail & Related papers (2023-09-13T17:55:11Z) - Towards Foundation Models for Scientific Machine Learning:
Characterizing Scaling and Transfer Behavior [32.74388989649232]
We study how pre-training could be used for scientific machine learning (SciML) applications.
We find that fine-tuning these models yields more performance gains as model size increases.
arXiv Detail & Related papers (2023-06-01T00:32:59Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - Revisiting Pre-training in Audio-Visual Learning [6.547660539954143]
We explore the effects of pre-trained models on two audio-visual learning scenarios.
We propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks.
arXiv Detail & Related papers (2023-02-07T15:34:14Z) - Towards Inadequately Pre-trained Models in Transfer Learning [37.66278189011681]
Better ImageNet pre-trained models have been demonstrated to have better transferability to downstream tasks.
In this paper, we found that during the same pre-training process, models at middle epochs, which is inadequately pre-trained, can outperform fully trained models.
Our discoveries suggest that, during pre-training, models tend to first learn spectral components corresponding to large singular values.
arXiv Detail & Related papers (2022-03-09T12:15:55Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based
Masked Language-models [51.53936551681613]
We show that fine-tuning only the bias terms (or a subset of the bias terms) of pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model.
They support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.
arXiv Detail & Related papers (2021-06-18T16:09:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.