Estimating Large Language Model Capabilities without Labeled Test Data
- URL: http://arxiv.org/abs/2305.14802v2
- Date: Thu, 26 Oct 2023 06:09:05 GMT
- Title: Estimating Large Language Model Capabilities without Labeled Test Data
- Authors: Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, Robin Jia
- Abstract summary: Large Language Models (LLMs) have the impressive ability to perform in-context learning (ICL) from only a few examples.
We propose the task of ICL accuracy estimation, in which we predict the accuracy of an LLM when doing in-context learning on a new task.
- Score: 51.428562302037534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have the impressive ability to perform
in-context learning (ICL) from only a few examples, but the success of ICL
varies widely from task to task. Thus, it is important to quickly determine
whether ICL is applicable to a new task, but directly evaluating ICL accuracy
can be expensive in situations where test data is expensive to annotate -- the
exact situations where ICL is most appealing. In this paper, we propose the
task of ICL accuracy estimation, in which we predict the accuracy of an LLM
when doing in-context learning on a new task given only unlabeled test data for
that task. To perform ICL accuracy estimation, we propose a method that trains
a meta-model using LLM confidence scores as features. We compare our method to
several strong accuracy estimation baselines on a new benchmark that covers 4
LLMs and 3 task collections. The meta-model improves over all baselines across
8 out of 12 settings and achieves the same estimation performance as directly
evaluating on 40 collected labeled test examples per task. At the same time, no
existing approach provides an accurate and reliable ICL accuracy estimation in
every setting, highlighting the need for better ways to measure the uncertainty
of LLM predictions.
Related papers
- BenTo: Benchmark Task Reduction with In-Context Transferability [32.561978389905434]
This paper investigates how to efficiently reduce the tasks used to benchmark large language models (LLMs)
We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL)
arXiv Detail & Related papers (2024-10-17T17:41:15Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Feature-Adaptive and Data-Scalable In-Context Learning [36.01997148676005]
FADS-ICL is a feature-adaptive and data-scalable in-context learning framework.
It can leverage task-adaptive features to promote inference on the downstream task.
FADS-ICL consistently outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2024-05-17T12:32:53Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Which Examples to Annotate for In-Context Learning? Towards Effective
and Efficient Selection [35.924633625147365]
Large Language Models (LLMs) can adapt to new tasks via in-context learning (ICL)
In this work, we investigate an active learning approach for ICL, where there is a limited budget for annotating examples.
We propose a model-adaptive optimization-free algorithm, termed AdaICL, which identifies examples that the model is uncertain about.
arXiv Detail & Related papers (2023-10-30T22:03:55Z) - Rapid Adaptation in Online Continual Learning: Are We Evaluating It
Right? [135.71855998537347]
We revisit the common practice of evaluating adaptation of Online Continual Learning (OCL) algorithms through the metric of online accuracy.
We show that this metric is unreliable, as even vacuous blind classifiers can achieve unrealistically high online accuracy.
Existing OCL algorithms can also achieve high online accuracy, but perform poorly in retaining useful information.
arXiv Detail & Related papers (2023-05-16T08:29:33Z) - Data Curation Alone Can Stabilize In-context Learning [20.874674130060388]
In-context learning (ICL) enables large language models to perform new tasks by prompting them with a sequence of training examples.
randomly sampling examples from a training set leads to high variance in performance.
We show that carefully curating a subset of training data greatly stabilizes ICL performance without any other changes to the ICL algorithm.
arXiv Detail & Related papers (2022-12-20T15:58:54Z) - Meta-Generating Deep Attentive Metric for Few-shot Classification [53.07108067253006]
We present a novel deep metric meta-generation method to generate a specific metric for a new few-shot learning task.
In this study, we structure the metric using a three-layer deep attentive network that is flexible enough to produce a discriminative metric for each task.
We gain surprisingly obvious performance improvement over state-of-the-art competitors, especially in the challenging cases.
arXiv Detail & Related papers (2020-12-03T02:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.