CLUES: Few-Shot Learning Evaluation in Natural Language Understanding
- URL: http://arxiv.org/abs/2111.02570v1
- Date: Thu, 4 Nov 2021 00:43:15 GMT
- Title: CLUES: Few-Shot Learning Evaluation in Natural Language Understanding
- Authors: Subhabrata Mukherjee, Xiaodong Liu, Guoqing Zheng, Saghar Hosseini,
Hao Cheng, Greg Yang, Christopher Meek, Ahmed Hassan Awadallah, Jianfeng Gao
- Abstract summary: We introduce CLUES, a benchmark for evaluating the few-shot learning capabilities of NLU models.
We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks.
- Score: 81.63968985419982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most recent progress in natural language understanding (NLU) has been driven,
in part, by benchmarks such as GLUE, SuperGLUE, SQuAD, etc. In fact, many NLU
models have now matched or exceeded "human-level" performance on many tasks in
these benchmarks. Most of these benchmarks, however, give models access to
relatively large amounts of labeled data for training. As such, the models are
provided far more data than required by humans to achieve strong performance.
That has motivated a line of work that focuses on improving few-shot learning
performance of NLU models. However, there is a lack of standardized evaluation
benchmarks for few-shot NLU resulting in different experimental settings in
different papers. To help accelerate this line of work, we introduce CLUES
(Constrained Language Understanding Evaluation Standard), a benchmark for
evaluating the few-shot learning capabilities of NLU models. We demonstrate
that while recent models reach human performance when they have access to large
amounts of labeled data, there is a huge gap in performance in the few-shot
setting for most tasks. We also demonstrate differences between alternative
model families and adaptation techniques in the few shot setting. Finally, we
discuss several principles and choices in designing the experimental settings
for evaluating the true few-shot learning performance and suggest a unified
standardized approach to few-shot learning evaluation. We aim to encourage
research on NLU models that can generalize to new tasks with a small number of
examples. Code and data for CLUES are available at
https://github.com/microsoft/CLUES.
Related papers
- Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs)
We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data.
We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z) - Robust Fine-Tuning of Vision-Language Models for Domain Generalization [6.7181844004432385]
Foundation models have impressive zero-shot inference capabilities and robustness under distribution shifts.
We present a new recipe for few-shot fine-tuning of the popular vision-language foundation model CLIP.
Our experimentation demonstrates that, while zero-shot CLIP fails to match performance of trained vision models on more complex benchmarks, few-shot CLIP fine-tuning outperforms its vision-only counterparts.
arXiv Detail & Related papers (2023-11-03T20:50:40Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Scalable Performance Analysis for Vision-Language Models [26.45624201546282]
Joint vision-language models have shown great performance over a diverse set of tasks.
Our paper introduces a more scalable solution that relies on already annotated benchmarks.
We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs.
arXiv Detail & Related papers (2023-05-30T06:40:08Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - Learning New Tasks from a Few Examples with Soft-Label Prototypes [18.363177410917597]
We propose a novel few-shot learning approach based on soft-label prototypes (SLPs)
We focus on learning previously unseen NLP tasks from very few examples (4, 8, 16) per class.
We experimentally demonstrate that our approach achieves superior performance on the majority of tested tasks in this data-lean setting.
arXiv Detail & Related papers (2022-10-31T16:06:48Z) - Zero-Shot Learners for Natural Language Understanding via a Unified
Multiple Choice Perspective [26.41585967095811]
Zero-shot learning aims to train a model on a given task such that it can address new learning tasks without any additional training.
Our approach converts zero-shot learning into multiple-choice tasks, avoiding problems in commonly used large-scale generative models such as FLAN.
Our approach shows state-of-the-art performance on several benchmarks and produces satisfactory results on tasks such as natural language inference and text classification.
arXiv Detail & Related papers (2022-10-16T17:24:06Z) - Meta learning to classify intent and slot labels with noisy few shot
examples [11.835266162072486]
Spoken language understanding (SLU) models are notorious for being data-hungry.
We propose a new SLU benchmarking task: few-shot robust SLU, where SLU comprises two core problems, intent classification (IC) and slot labeling (SL)
We show the model consistently outperforms the conventional fine-tuning baseline and another popular meta-learning method, Model-Agnostic Meta-Learning (MAML), in terms of achieving better IC accuracy and SL F1.
arXiv Detail & Related papers (2020-11-30T18:53:30Z) - FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding [55.38905499274026]
Few-shot learning is one of the key future steps in machine learning.
FewJoint is a novel Few-Shot Learning benchmark for NLP.
arXiv Detail & Related papers (2020-09-17T08:17:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.