Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements
- URL: http://arxiv.org/abs/2401.06766v3
- Date: Thu, 6 Jun 2024 19:01:37 GMT
- Title: Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements
- Authors: Anton Voronov, Lena Wolf, Max Ryabinin,
- Abstract summary: Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples.
The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning.
We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level.
- Score: 10.687101698324897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples. The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning. In this work, we conduct a comprehensive study of the template format's influence on the in-context learning performance. We evaluate the impact of the prompt template across 21 models (from 770M to 70B parameters) and 4 standard classification datasets. We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level. More importantly, the best templates do not transfer between different setups and even between models of the same family. Our findings show that the currently prevalent approach to evaluation, which ignores template selection, may give misleading results due to different templates in different works. As a first step towards mitigating this issue, we propose Template Ensembles that aggregate model predictions across several templates. This simple test-time augmentation boosts average performance while being robust to the choice of random set of templates.
Related papers
- Detection and Measurement of Syntactic Templates in Generated Text [58.111650675717414]
We offer an analysis of syntactic features to characterize general repetition in models.
We find that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts.
arXiv Detail & Related papers (2024-06-28T19:34:23Z) - BvSP: Broad-view Soft Prompting for Few-Shot Aspect Sentiment Quad Prediction [10.313467662221319]
Aspect sentiment quad prediction (ASQP) aims to predict four aspect-based elements, including aspect term, opinion term, aspect category, and sentiment polarity.
This work formulates ASQP into the few-shot scenario, which aims for fast adaptation in real applications.
arXiv Detail & Related papers (2024-06-11T15:32:32Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Label-Efficient Model Selection for Text Generation [14.61636207880449]
We introduce DiffUse, a method to make an informed decision between candidate text generation models based on preference annotations.
In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations.
arXiv Detail & Related papers (2024-02-12T18:54:02Z) - Comparing Template-based and Template-free Language Model Probing [0.0]
We evaluate 16 different cloze-task language models (LMs) on 10 probing English datasets.
We find that template-free and template-based approaches often rank models differently, except for the top domain-specific models.
arXiv Detail & Related papers (2024-01-31T19:07:37Z) - MixPro: Simple yet Effective Data Augmentation for Prompt-based Learning [53.185180119904174]
We introduce MixPro, an augmentation method designed to augment both the vanilla input text and the templates.
Experiments show that MixPro outperforms other augmentation baselines, improving model performance by an average of 5.08%.
arXiv Detail & Related papers (2023-04-19T03:38:25Z) - Model ensemble instead of prompt fusion: a sample-specific knowledge
transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks.
We propose Sample-specific Ensemble of Source Models (SESoM)
SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z) - An Information-theoretic Approach to Prompt Engineering Without Ground
Truth Labels [55.06990011183662]
We introduce a new method for selecting prompt templates textitwithout labeled examples and textitwithout direct access to the model.
Across 8 datasets representing 7 distinct NLP tasks, we show that when a template has high mutual information, it also has high accuracy on the task.
arXiv Detail & Related papers (2022-03-21T21:51:43Z) - Template-free Prompt Tuning for Few-shot NER [46.59447116255979]
We propose a more elegant method to reformulate NER tasks as LM problems without any templates.
Specifically, we discard the template construction process while maintaining the word prediction paradigm of pre-training models.
Experimental results demonstrate the effectiveness of the proposed method over bert-tagger and template-based method under few-shot setting.
arXiv Detail & Related papers (2021-09-28T07:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.