Related papers: Estimating Large Language Model Capabilities without Labeled Test Data

Estimating Large Language Model Capabilities without Labeled Test Data

URL: http://arxiv.org/abs/2305.14802v2
Date: Thu, 26 Oct 2023 06:09:05 GMT
Title: Estimating Large Language Model Capabilities without Labeled Test Data
Authors: Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, Robin Jia
Abstract summary: Large Language Models (LLMs) have the impressive ability to perform in-context learning (ICL) from only a few examples. We propose the task of ICL accuracy estimation, in which we predict the accuracy of an LLM when doing in-context learning on a new task.
Score: 51.428562302037534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have the impressive ability to perform in-context learning (ICL) from only a few examples, but the success of ICL varies widely from task to task. Thus, it is important to quickly determine whether ICL is applicable to a new task, but directly evaluating ICL accuracy can be expensive in situations where test data is expensive to annotate -- the exact situations where ICL is most appealing. In this paper, we propose the task of ICL accuracy estimation, in which we predict the accuracy of an LLM when doing in-context learning on a new task given only unlabeled test data for that task. To perform ICL accuracy estimation, we propose a method that trains a meta-model using LLM confidence scores as features. We compare our method to several strong accuracy estimation baselines on a new benchmark that covers 4 LLMs and 3 task collections. The meta-model improves over all baselines across 8 out of 12 settings and achieves the same estimation performance as directly evaluating on 40 collected labeled test examples per task. At the same time, no existing approach provides an accurate and reliable ICL accuracy estimation in every setting, highlighting the need for better ways to measure the uncertainty of LLM predictions.

Related papers

Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters. We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
BenTo: Benchmark Task Reduction with In-Context Transferability [32.561978389905434]
This paper investigates how to efficiently reduce the tasks used to benchmark large language models (LLMs) We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL)
arXiv Detail & Related papers (2024-10-17T17:41:15Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Feature-Adaptive and Data-Scalable In-Context Learning [36.01997148676005]
FADS-ICL is a feature-adaptive and data-scalable in-context learning framework. It can leverage task-adaptive features to promote inference on the downstream task. FADS-ICL consistently outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2024-05-17T12:32:53Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection [35.924633625147365]
Large Language Models (LLMs) can adapt to new tasks via in-context learning (ICL) In this work, we investigate an active learning approach for ICL, where there is a limited budget for annotating examples. We propose a model-adaptive optimization-free algorithm, termed AdaICL, which identifies examples that the model is uncertain about.
arXiv Detail & Related papers (2023-10-30T22:03:55Z)
Rapid Adaptation in Online Continual Learning: Are We Evaluating It Right? [135.71855998537347]
We revisit the common practice of evaluating adaptation of Online Continual Learning (OCL) algorithms through the metric of online accuracy. We show that this metric is unreliable, as even vacuous blind classifiers can achieve unrealistically high online accuracy. Existing OCL algorithms can also achieve high online accuracy, but perform poorly in retaining useful information.
arXiv Detail & Related papers (2023-05-16T08:29:33Z)
Data Curation Alone Can Stabilize In-context Learning [20.874674130060388]
In-context learning (ICL) enables large language models to perform new tasks by prompting them with a sequence of training examples. randomly sampling examples from a training set leads to high variance in performance. We show that carefully curating a subset of training data greatly stabilizes ICL performance without any other changes to the ICL algorithm.
arXiv Detail & Related papers (2022-12-20T15:58:54Z)
Meta-Generating Deep Attentive Metric for Few-shot Classification [53.07108067253006]
We present a novel deep metric meta-generation method to generate a specific metric for a new few-shot learning task. In this study, we structure the metric using a three-layer deep attentive network that is flexible enough to produce a discriminative metric for each task. We gain surprisingly obvious performance improvement over state-of-the-art competitors, especially in the challenging cases.
arXiv Detail & Related papers (2020-12-03T02:07:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.