OPT-IML: Scaling Language Model Instruction Meta Learning through the
Lens of Generalization
- URL: http://arxiv.org/abs/2212.12017v1
- Date: Thu, 22 Dec 2022 19:56:09 GMT
- Title: OPT-IML: Scaling Language Model Instruction Meta Learning through the
Lens of Generalization
- Authors: Srinivasan Iyer and Xi Victoria Lin and Ramakanth Pasunuru and Todor
Mihaylov and Daniel Simig and Ping Yu and Kurt Shuster and Tianlu Wang and
Qing Liu and Punit Singh Koura and Xian Li and Brian O'Horo and Gabriel
Pereyra and Jeff Wang and Christopher Dewan and Asli Celikyilmaz and Luke
Zettlemoyer and Ves Stoyanov
- Abstract summary: We describe the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes.
We present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT.
- Score: 101.37439352091612
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that fine-tuning large pre-trained language models on a
collection of tasks described via instructions, a.k.a. instruction-tuning,
improves their zero and few-shot generalization to unseen tasks. However, there
is a limited understanding of the performance trade-offs of different decisions
made during the instruction-tuning process. These decisions include the scale
and diversity of the instruction-tuning benchmark, different task sampling
strategies, fine-tuning with and without demonstrations, training using
specialized datasets for reasoning and dialogue, and finally, the fine-tuning
objectives themselves. In this paper, we characterize the effect of
instruction-tuning decisions on downstream task performance when scaling both
model and benchmark sizes. To this end, we create OPT-IML Bench: a large
benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated
into task categories from 8 existing benchmarks, and prepare an evaluation
framework to measure three types of model generalizations: to tasks from fully
held-out categories, to held-out tasks from seen categories, and to held-out
instances from seen tasks. Through the lens of this framework, we first present
insights about instruction-tuning decisions as applied to OPT-30B and further
exploit these insights to train OPT-IML 30B and 175B, which are
instruction-tuned versions of OPT. OPT-IML demonstrates all three
generalization abilities at both scales on four different evaluation benchmarks
with diverse tasks and input formats -- PromptSource, FLAN,
Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly
outperform OPT on all benchmarks but is also highly competitive with existing
models fine-tuned on each specific benchmark. We release OPT-IML at both
scales, together with the OPT-IML Bench evaluation framework.
Related papers
- MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics
Capabilities [40.55743949223173]
Pragmatics Understanding Benchmark (PUB) is a dataset consisting of fourteen tasks in four pragmatics phenomena.
PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets.
Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models.
arXiv Detail & Related papers (2024-01-13T13:46:14Z) - Balancing Specialized and General Skills in LLMs: The Impact of Modern
Tuning and Data Strategy [27.365319494865165]
The paper details the design, data collection, analytical techniques, and results validating the proposed frameworks.
It aims to provide businesses and researchers with actionable insights on effectively adapting LLMs for specialized contexts.
arXiv Detail & Related papers (2023-10-07T23:29:00Z) - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Benchmarking Generalization via In-Context Instructions on 1,600+
Language Tasks [95.06087720086133]
Natural-Instructions v2 is a collection of 1,600+ diverse language tasks and their expert written instructions.
The benchmark covers 70+ distinct task types, such as tagging, in-filling, and rewriting.
This benchmark enables large-scale evaluation of cross-task generalization of the models.
arXiv Detail & Related papers (2022-04-16T03:12:30Z) - CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented
Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions.
We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD.
Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z) - Revisiting Unsupervised Meta-Learning: Amplifying or Compensating for
the Characteristics of Few-Shot Tasks [30.893785366366078]
We develop a practical approach towards few-shot image classification, where a visual recognition system is constructed with limited data.
We find that the base class set labels are not necessary, and discriminative embeddings could be meta-learned in an unsupervised manner.
Experiments on few-shot learning benchmarks verify our approaches outperform previous methods by a 4-10% performance gap.
arXiv Detail & Related papers (2020-11-30T10:08:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.