Related papers: FLEX: Unifying Evaluation for Few-Shot NLP

FLEX: Unifying Evaluation for Few-Shot NLP

URL: http://arxiv.org/abs/2107.07170v1
Date: Thu, 15 Jul 2021 07:37:06 GMT
Title: FLEX: Unifying Evaluation for Few-Shot NLP
Authors: Jonathan Bragg, Arman Cohan, Kyle Lo, Iz Beltagy
Abstract summary: We formulate desiderata for an ideal few-shot NLP benchmark. We present FLEX, the first benchmark, public leaderboard, and framework. We also present UniFew, a simple yet strong prompt-based model for few-shot learning.
Score: 17.425495611344786
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-shot NLP research is highly active, yet conducted in disjoint research threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental design. Consequently, the community does not know which techniques perform best or even if they outperform simple baselines. We formulate desiderata for an ideal few-shot NLP benchmark and present FLEX, the first benchmark, public leaderboard, and framework that provides unified, comprehensive measurement for few-shot NLP techniques. FLEX incorporates and introduces new best practices for few-shot evaluation, including measurement of four transfer settings, textual labels for zero-shot evaluation, and a principled approach to benchmark design that optimizes statistical accuracy while keeping evaluation costs accessible to researchers without large compute resources. In addition, we present UniFew, a simple yet strong prompt-based model for few-shot learning which unifies the pretraining and finetuning prompt formats, eschewing complex machinery of recent prompt-based approaches in adapting downstream task formats to language model pretraining objectives. We demonstrate that despite simplicity UniFew achieves results competitive with both popular meta-learning and prompt-based approaches.

Related papers

W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models [2.0033725235099986]
We introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models. We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach.
arXiv Detail & Related papers (2025-04-22T15:33:01Z)
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility [29.437125712259046]
Reasoning has emerged as the next major frontier for language models (LMs) We conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices. We propose a standardized evaluation framework with clearly defined best practices and reporting standards.
arXiv Detail & Related papers (2025-04-09T17:58:17Z)
On Speeding Up Language Model Evaluation [48.51924035873411]
Development of prompt-based methods with Large Language Models (LLMs) requires making numerous decisions. We propose a novel method to address this challenge. We show that it can identify the top-performing method using only 5-15% of the typically needed resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z)
DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection [52.100335904875614]
We present a novel prompt tuning approach, namely, Decomposed Context Optimization (DeCoOp), which introduces new-class detectors and sub-classifiers to further enhance the base-class and new-class discriminability. Experimental results on 11 benchmark datasets validate the effectiveness of DePT and demonstrate that DeCoOp outperforms current state-of-the-art methods, providing a significant 2% average accuracy improvement.
arXiv Detail & Related papers (2024-06-01T07:46:42Z)
Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes. Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts. We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z)
An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models [55.01592097059969]
Supervised finetuning on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities. Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool. We propose using experimental design to circumvent the computational bottlenecks of active learning.
arXiv Detail & Related papers (2024-01-12T16:56:54Z)
Large-scale Pre-trained Models are Surprisingly Strong in Incremental Novel Class Discovery [76.63807209414789]
We challenge the status quo in class-iNCD and propose a learning paradigm where class discovery occurs continuously and truly unsupervisedly. We propose simple baselines, composed of a frozen PTM backbone and a learnable linear classifier, that are not only simple to implement but also resilient under longer learning scenarios.
arXiv Detail & Related papers (2023-03-28T13:47:16Z)
Learning New Tasks from a Few Examples with Soft-Label Prototypes [18.363177410917597]
We propose a novel few-shot learning approach based on soft-label prototypes (SLPs) We focus on learning previously unseen NLP tasks from very few examples (4, 8, 16) per class. We experimentally demonstrate that our approach achieves superior performance on the majority of tested tasks in this data-lean setting.
arXiv Detail & Related papers (2022-10-31T16:06:48Z)
Robustness Gym: Unifying the NLP Evaluation Landscape [91.80175115162218]
Deep neural networks are often brittle when deployed in real-world systems. Recent research has focused on testing the robustness of such models. We propose a solution in the form of Robustness Gym, a simple and evaluation toolkit.
arXiv Detail & Related papers (2021-01-13T02:37:54Z)
Making Pre-trained Language Models Better Few-shot Learners [11.90626040104822]
Recent GPT-3 model achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples.
arXiv Detail & Related papers (2020-12-31T17:21:26Z)
SE3M: A Model for Software Effort Estimation Using Pre-trained Embedding Models [0.8287206589886881]
This paper proposes to evaluate the effectiveness of pre-trained embeddings models. Generic pre-trained models for both approaches went through a fine-tuning process. Results were very promising, realizing that pre-trained models can be used to estimate software effort based only on requirements texts.
arXiv Detail & Related papers (2020-06-30T14:15:38Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.