Task Ambiguity in Humans and Language Models
- URL: http://arxiv.org/abs/2212.10711v1
- Date: Tue, 20 Dec 2022 18:35:33 GMT
- Title: Task Ambiguity in Humans and Language Models
- Authors: Alex Tamkin, Kunal Handa, Avash Shrestha, Noah Goodman
- Abstract summary: We propose AmbiBench, a new benchmark of ambiguously-specified classification tasks.
We evaluate humans and models on AmbiBench by seeing how well they identify the intended task.
We show how to dramatically improve the accuracy of language models trained without large-scale human feedback training.
- Score: 7.033374427612259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models have recently achieved strong performance across a wide range
of NLP benchmarks. However, unlike benchmarks, real world tasks are often
poorly specified, and agents must deduce the user's intended behavior from a
combination of context, instructions, and examples. We investigate how both
humans and models behave in the face of such task ambiguity by proposing
AmbiBench, a new benchmark of six ambiguously-specified classification tasks.
We evaluate humans and models on AmbiBench by seeing how well they identify the
intended task using 1) instructions with varying degrees of ambiguity, and 2)
different numbers of labeled examples. We find that the combination of model
scaling (to 175B parameters) and training with human feedback data enables
models to approach or exceed the accuracy of human participants across tasks,
but that either one alone is not sufficient. In addition, we show how to
dramatically improve the accuracy of language models trained without
large-scale human feedback training by finetuning on a small number of
ambiguous in-context examples, providing a promising direction for teaching
models to generalize well in the face of ambiguity.
Related papers
- DevBench: A multimodal developmental benchmark for language learning [0.34129029452670606]
We introduce DevBench, a benchmark for evaluating vision-language models on tasks and behavioral data.
We show that DevBench provides a benchmark for comparing models to human language development.
These comparisons highlight ways in which model and human language learning processes diverge.
arXiv Detail & Related papers (2024-06-14T17:49:41Z) - Automatic Evaluation of Generative Models with Instruction Tuning [14.369719297698694]
Recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion.
Inspired by the generalization ability of instruction-tuned models, we propose a learned metric based on instruction tuning.
arXiv Detail & Related papers (2023-10-30T23:00:52Z) - ProsAudit, a prosodic benchmark for self-supervised speech models [14.198508548718676]
ProsAudit is a benchmark to assess structural prosodic knowledge in self-supervised learning (SSL) speech models.
It consists of two subtasks, their corresponding metrics, and an evaluation dataset.
arXiv Detail & Related papers (2023-02-23T14:30:23Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - MiQA: A Benchmark for Inference on Metaphorical Questions [5.32836690371986]
We propose a benchmark to assess the capability of large language models to reason with conventional metaphors.
We examine the performance of state-of-the-art pre-trained models on binary-choice tasks.
arXiv Detail & Related papers (2022-10-14T17:46:05Z) - Benchmarking Generalization via In-Context Instructions on 1,600+
Language Tasks [95.06087720086133]
Natural-Instructions v2 is a collection of 1,600+ diverse language tasks and their expert written instructions.
The benchmark covers 70+ distinct task types, such as tagging, in-filling, and rewriting.
This benchmark enables large-scale evaluation of cross-task generalization of the models.
arXiv Detail & Related papers (2022-04-16T03:12:30Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Multitask Prompted Training Enables Zero-Shot Task Generalization [70.12770442071657]
We develop a system for mapping general natural language tasks into a human-readable prompted form.
We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks.
The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.
arXiv Detail & Related papers (2021-10-15T17:08:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.