Data-Efficiency with a Single GPU: An Exploration of Transfer Methods
for Small Language Models
- URL: http://arxiv.org/abs/2210.03871v1
- Date: Sat, 8 Oct 2022 01:45:22 GMT
- Title: Data-Efficiency with a Single GPU: An Exploration of Transfer Methods
for Small Language Models
- Authors: Alon Albalak, Akshat Shrivastava, Chinnadhurai Sankar, Adithya Sagar,
Mike Ross
- Abstract summary: Multi-task learning, instruction tuning, and prompting have been shown to improve the generalizability of large language models to new tasks.
This work explores and isolate the effects of (i) model size, (ii) general purpose MTL, (iii) in-domain MTL, (iv) instruction tuning, and (v) few-shot fine-tuning for models with fewer than 500 million parameters.
- Score: 5.539060030062833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-task learning (MTL), instruction tuning, and prompting have recently
been shown to improve the generalizability of large language models to new
tasks. However, the benefits of such methods are less well-documented in
smaller language models, with some studies finding contradictory results. In
this work, we explore and isolate the effects of (i) model size, (ii) general
purpose MTL, (iii) in-domain MTL, (iv) instruction tuning, and (v) few-shot
fine-tuning for models with fewer than 500 million parameters. Our experiments
in the zero-shot setting demonstrate that models gain 31% relative improvement,
on average, from general purpose MTL, with an additional 37.6% relative gain
from in-domain MTL. Contradictory to prior works on large models, we find that
instruction tuning provides a modest 2% performance improvement for small
models.
Related papers
- Fine-tuning Large Language Models for Entity Matching [3.7277730514654555]
Generative large language models (LLMs) are a promising alternative to pre-trained language models for entity matching.
This paper explores the potential of fine-tuning LLMs for entity matching.
arXiv Detail & Related papers (2024-09-12T16:20:57Z) - Emergent Abilities in Reduced-Scale Generative Language Models [10.51168925267033]
Large language models can solve new tasks without task-specific fine-tuning.
This ability is considered an emergent ability and is primarily seen in large language models with billions of parameters.
This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data.
arXiv Detail & Related papers (2024-04-02T18:00:28Z) - Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models.
We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability.
We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Analyzing Bagging Methods for Language Models [0.5161531917413708]
We perform an analysis of bagging language models and compare single language models to bagged ensembles that are roughly equivalent in terms of final model size.
Our ensembling methods are at best roughly equivalent to single LM baselines.
arXiv Detail & Related papers (2022-07-19T06:30:37Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z) - Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models.
Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.