Benchmarking Generalization via In-Context Instructions on 1,600+
Language Tasks
- URL: http://arxiv.org/abs/2204.07705v1
- Date: Sat, 16 Apr 2022 03:12:30 GMT
- Title: Benchmarking Generalization via In-Context Instructions on 1,600+
Language Tasks
- Authors: Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi,
Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran,
Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary
Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi,
Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali
Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh
Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra,
Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin
Choi, Hannaneh Hajishirzi, Noah A. Smith, Daniel Khashabi
- Abstract summary: Natural-Instructions v2 is a collection of 1,600+ diverse language tasks and their expert written instructions.
The benchmark covers 70+ distinct task types, such as tagging, in-filling, and rewriting.
This benchmark enables large-scale evaluation of cross-task generalization of the models.
- Score: 95.06087720086133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How can we measure the generalization of models to a variety of unseen tasks
when provided with their language instructions? To facilitate progress in this
goal, we introduce Natural-Instructions v2, a collection of 1,600+ diverse
language tasks and their expert written instructions. More importantly, the
benchmark covers 70+ distinct task types, such as tagging, in-filling, and
rewriting. This benchmark is collected with contributions of NLP practitioners
in the community and through an iterative peer review process to ensure their
quality. This benchmark enables large-scale evaluation of cross-task
generalization of the models -- training on a subset of tasks and evaluating on
the remaining unseen ones. For instance, we are able to rigorously quantify
generalization as a function of various scaling parameters, such as the number
of observed tasks, the number of instances, and model sizes. As a by-product of
these experiments. we introduce Tk-Instruct, an encoder-decoder Transformer
that is trained to follow a variety of in-context instructions (plain language
task definitions or k-shot examples) which outperforms existing larger models
on our benchmark. We hope this benchmark facilitates future progress toward
more general-purpose language understanding models.
Related papers
- SpeechVerse: A Large-scale Generalizable Audio Language Model [38.67969337605572]
SpeechVerse is a robust multi-task training and curriculum learning framework.
It combines pre-trained speech and text foundation models via a small set of learnable parameters.
Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
arXiv Detail & Related papers (2024-05-14T03:33:31Z) - In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax [36.98247762224868]
In-context learning (ICL) is now a common method for teaching large language models (LLMs) new tasks.
Do models infer the underlying structure of the task defined by the context, or do they rely on superficial generalizations that only generalize to identically distributed examples?
In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs.
The variance is explained more by the composition of the pre-training corpus and supervision methods than by model size.
arXiv Detail & Related papers (2023-11-13T23:52:43Z) - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Task Ambiguity in Humans and Language Models [7.033374427612259]
We propose AmbiBench, a new benchmark of ambiguously-specified classification tasks.
We evaluate humans and models on AmbiBench by seeing how well they identify the intended task.
We show how to dramatically improve the accuracy of language models trained without large-scale human feedback training.
arXiv Detail & Related papers (2022-12-20T18:35:33Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - Multitask Prompted Training Enables Zero-Shot Task Generalization [70.12770442071657]
We develop a system for mapping general natural language tasks into a human-readable prompted form.
We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks.
The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.
arXiv Detail & Related papers (2021-10-15T17:08:57Z) - Can Machines Read Coding Manuals Yet? -- A Benchmark for Building Better
Language Models for Code Understanding [3.98345038769576]
We derive a set of benchmarks that assess code understanding based on tasks such as predicting the best answer to a question in a forum post.
We evaluate the performance of current state-of-the-art language models on these tasks and show that there is a significant improvement on each task from fine tuning.
arXiv Detail & Related papers (2021-09-15T17:42:44Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.