FLEX: Unifying Evaluation for Few-Shot NLP
- URL: http://arxiv.org/abs/2107.07170v1
- Date: Thu, 15 Jul 2021 07:37:06 GMT
- Title: FLEX: Unifying Evaluation for Few-Shot NLP
- Authors: Jonathan Bragg, Arman Cohan, Kyle Lo, Iz Beltagy
- Abstract summary: We formulate desiderata for an ideal few-shot NLP benchmark.
We present FLEX, the first benchmark, public leaderboard, and framework.
We also present UniFew, a simple yet strong prompt-based model for few-shot learning.
- Score: 17.425495611344786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot NLP research is highly active, yet conducted in disjoint research
threads with evaluation suites that lack challenging-yet-realistic testing
setups and fail to employ careful experimental design. Consequently, the
community does not know which techniques perform best or even if they
outperform simple baselines. We formulate desiderata for an ideal few-shot NLP
benchmark and present FLEX, the first benchmark, public leaderboard, and
framework that provides unified, comprehensive measurement for few-shot NLP
techniques. FLEX incorporates and introduces new best practices for few-shot
evaluation, including measurement of four transfer settings, textual labels for
zero-shot evaluation, and a principled approach to benchmark design that
optimizes statistical accuracy while keeping evaluation costs accessible to
researchers without large compute resources. In addition, we present UniFew, a
simple yet strong prompt-based model for few-shot learning which unifies the
pretraining and finetuning prompt formats, eschewing complex machinery of
recent prompt-based approaches in adapting downstream task formats to language
model pretraining objectives. We demonstrate that despite simplicity UniFew
achieves results competitive with both popular meta-learning and prompt-based
approaches.
Related papers
- Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [85.51252685938564]
Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML)
As with other ML models, large language models (LLMs) are prone to make incorrect predictions, hallucinate'' by fabricating claims, or simply generate low-quality output for a given input.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines, and provides an environment for controllable and consistent evaluation of novel techniques.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection [52.100335904875614]
We present a novel prompt tuning approach, namely, Decomposed Context Optimization (DeCoOp), which introduces new-class detectors and sub-classifiers to further enhance the base-class and new-class discriminability.
Experimental results on 11 benchmark datasets validate the effectiveness of DePT and demonstrate that DeCoOp outperforms current state-of-the-art methods, providing a significant 2% average accuracy improvement.
arXiv Detail & Related papers (2024-06-01T07:46:42Z) - An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models [55.01592097059969]
Supervised finetuning on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities.
Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool.
We propose using experimental design to circumvent the computational bottlenecks of active learning.
arXiv Detail & Related papers (2024-01-12T16:56:54Z) - Large-scale Pre-trained Models are Surprisingly Strong in Incremental Novel Class Discovery [76.63807209414789]
We challenge the status quo in class-iNCD and propose a learning paradigm where class discovery occurs continuously and truly unsupervisedly.
We propose simple baselines, composed of a frozen PTM backbone and a learnable linear classifier, that are not only simple to implement but also resilient under longer learning scenarios.
arXiv Detail & Related papers (2023-03-28T13:47:16Z) - Robustness Gym: Unifying the NLP Evaluation Landscape [91.80175115162218]
Deep neural networks are often brittle when deployed in real-world systems.
Recent research has focused on testing the robustness of such models.
We propose a solution in the form of Robustness Gym, a simple and evaluation toolkit.
arXiv Detail & Related papers (2021-01-13T02:37:54Z) - Making Pre-trained Language Models Better Few-shot Learners [11.90626040104822]
Recent GPT-3 model achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context.
Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient.
We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples.
arXiv Detail & Related papers (2020-12-31T17:21:26Z) - SE3M: A Model for Software Effort Estimation Using Pre-trained Embedding
Models [0.8287206589886881]
This paper proposes to evaluate the effectiveness of pre-trained embeddings models.
Generic pre-trained models for both approaches went through a fine-tuning process.
Results were very promising, realizing that pre-trained models can be used to estimate software effort based only on requirements texts.
arXiv Detail & Related papers (2020-06-30T14:15:38Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.