The ICL Consistency Test
- URL: http://arxiv.org/abs/2312.04945v1
- Date: Fri, 8 Dec 2023 10:22:43 GMT
- Title: The ICL Consistency Test
- Authors: Lucas Weber, Elia Bruni, Dieuwke Hupkes
- Abstract summary: Large language models (LLMs) are adapted to tasks via prompt-based methods like in-context-learning (ICL)
This lack of consistency in prompt-based learning hints at a lack of robust generalisation.
We here introduce the ICL consistency test -- a contribution to the GenBench collaborative benchmark task (CBT)
- Score: 14.569770617709073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Just like the previous generation of task-tuned models, large language models
(LLMs) that are adapted to tasks via prompt-based methods like
in-context-learning (ICL) perform well in some setups but not in others. This
lack of consistency in prompt-based learning hints at a lack of robust
generalisation. We here introduce the ICL consistency test -- a contribution to
the GenBench collaborative benchmark task (CBT) -- which evaluates how
consistent a model makes predictions across many different setups while using
the same data. The test is based on different established natural language
inference tasks. We provide preprocessed data constituting 96 different
'setups' and a metric that estimates model consistency across these setups. The
metric is provided on a fine-grained level to understand what properties of a
setup render predictions unstable and on an aggregated level to compare overall
model consistency. We conduct an empirical analysis of eight state-of-the-art
models, and our consistency metric reveals how all tested LLMs lack robust
generalisation.
Related papers
- Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - cDP-MIL: Robust Multiple Instance Learning via Cascaded Dirichlet Process [23.266122629592807]
Multiple instance learning (MIL) has been extensively applied to whole slide histoparametric image (WSI) analysis.
The existing aggregation strategy in MIL, which primarily relies on the first-order distance between instances, fails to accurately approximate the true feature distribution of each instance.
We propose a new Bayesian nonparametric framework for multiple instance learning, which adopts a cascade of Dirichlet processes (cDP) to incorporate the instance-to-bag characteristic of the WSIs.
arXiv Detail & Related papers (2024-07-16T07:28:39Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Diversity-Aware Ensembling of Language Models Based on Topological Data
Analysis [3.1734682813501514]
Existing approaches mostly rely on simple averaging of predictions by ensembles with equal weights for each model.
We propose to estimate weights for ensembles of NLP models using not only knowledge of their individual performance but also their similarity to each other.
arXiv Detail & Related papers (2024-02-22T00:04:21Z) - Revisiting Demonstration Selection Strategies in In-Context Learning [66.11652803887284]
Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL)
In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent.
We propose a data- and model-dependent demonstration selection method, textbfTopK + ConE, based on the assumption that textitthe performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples.
arXiv Detail & Related papers (2024-01-22T16:25:27Z) - Mind the instructions: a holistic evaluation of consistency and
interactions in prompt-based learning [14.569770617709073]
We present a detailed analysis of which design choices cause instabilities and inconsistencies in task predictions.
We show how spurious correlations between input distributions and labels form only a minor problem for prompted models.
We statistically analyse the results to show which factors are the most influential, interactive or stable.
arXiv Detail & Related papers (2023-10-20T13:25:24Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - Model ensemble instead of prompt fusion: a sample-specific knowledge
transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks.
We propose Sample-specific Ensemble of Source Models (SESoM)
SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.