Do Models Really Learn to Follow Instructions? An Empirical Study of
Instruction Tuning
- URL: http://arxiv.org/abs/2305.11383v2
- Date: Thu, 25 May 2023 21:07:07 GMT
- Title: Do Models Really Learn to Follow Instructions? An Empirical Study of
Instruction Tuning
- Authors: Po-Nien Kung and Nanyun Peng
- Abstract summary: Recent works on instruction tuning (IT) have achieved great performance with zero-shot generalizability to unseen tasks.
We analyze how models utilize instructions during IT by comparing model training with altered vs. original instructions.
- Score: 37.01833561948585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works on instruction tuning (IT) have achieved great performance with
zero-shot generalizability to unseen tasks. With additional context (e.g., task
definition, examples) provided to models for fine-tuning, they achieved much
higher performance than untuned models. Despite impressive performance gains,
what models learn from IT remains understudied. In this work, we analyze how
models utilize instructions during IT by comparing model training with altered
vs. original instructions. Specifically, we create simplified task definitions
by removing all semantic components and only leaving the output space
information, and delusive examples that contain incorrect input-output mapping.
Our experiments show that models trained on simplified task definition or
delusive examples can achieve comparable performance to the ones trained on the
original instructions and examples. Furthermore, we introduce a random baseline
to perform zeroshot classification tasks, and find it achieves similar
performance (42.6% exact-match) as IT does (43% exact-match) in low resource
setting, while both methods outperform naive T5 significantly (30% per
exact-match). Our analysis provides evidence that the impressive performance
gain of current IT models can come from picking up superficial patterns, such
as learning the output format and guessing. Our study highlights the urgent
need for more reliable IT methods and evaluation.
Related papers
- VQA Training Sets are Self-play Environments for Generating Few-shot Pools [2.556825820539693]
We propose a technique in which existing training sets can be directly used for constructing computational environments with task metrics as rewards.
The proposed method starts with zero-shot prompts and iteratively refines them by selecting few-shot examples that maximize the task metric on the training set.
Our experiments showcase how Gemini learns how to use itself, or another smaller and specialized model such as ScreenAI, to iteratively improve performance on training sets.
arXiv Detail & Related papers (2024-05-30T07:38:58Z) - DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning [75.68193159293425]
In-context learning (ICL) allows transformer-based language models to learn a specific task with a few "task demonstrations" without updating their parameters.
We propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL.
We experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
arXiv Detail & Related papers (2024-05-22T15:52:52Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Instruction Tuned Models are Quick Learners [20.771930945083994]
In this work, we demonstrate the sample efficiency of instruction tuned models over various tasks.
In the STL setting, instruction tuned models equipped with 25% of the downstream train data surpass the SOTA performance on the downstream tasks.
In the MTL setting, an instruction tuned model trained on only 6% of downstream training data achieve SOTA, while using 100% of the training data results in a 3.69% points improvement.
arXiv Detail & Related papers (2023-05-17T22:30:01Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - Assessing Out-of-Domain Language Model Performance from Few Examples [38.245449474937914]
We address the task of predicting out-of-domain (OOD) performance in a few-shot fashion.
We benchmark the performance on this task when looking at model accuracy on the few-shot examples.
We show that attribution-based factors can help rank relative model OOD performance.
arXiv Detail & Related papers (2022-10-13T04:45:26Z) - How Many Data Samples is an Additional Instruction Worth? [20.66688303609522]
Recently introduced instruction-paradigm empowers non-expert users to leverage NLP resources by defining a new task in natural language.
Our results indicate that an additional instruction can be equivalent to 200 data samples on average across tasks.
arXiv Detail & Related papers (2022-03-17T08:30:30Z) - Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models.
Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics.
We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains.
Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.