How Many Validation Labels Do You Need? Exploring the Design Space of
Label-Efficient Model Ranking
- URL: http://arxiv.org/abs/2312.01619v3
- Date: Sat, 17 Feb 2024 13:42:03 GMT
- Title: How Many Validation Labels Do You Need? Exploring the Design Space of
Label-Efficient Model Ranking
- Authors: Zhengyu Hu, Jieyu Zhang, Yue Yu, Yuchen Zhuang, Hui Xiong
- Abstract summary: This paper presents LEMR (Label-Efficient Model Ranking) and introduces the MoraBench Benchmark.
LEMR is a novel framework that minimizes the need for costly annotations in model selection by strategically annotating instances from an unlabeled validation set.
- Score: 40.39898960460575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents LEMR (Label-Efficient Model Ranking) and introduces the
MoraBench Benchmark. LEMR is a novel framework that minimizes the need for
costly annotations in model selection by strategically annotating instances
from an unlabeled validation set. To evaluate LEMR, we leverage the MoraBench
Benchmark, a comprehensive collection of model outputs across diverse
scenarios. Our extensive evaluation across 23 different NLP tasks in
semi-supervised learning, weak supervision, and prompt selection tasks
demonstrates LEMR's effectiveness in significantly reducing labeling costs. Key
findings highlight the impact of suitable ensemble methods, uncertainty
sampling strategies, and model committee selection in enhancing model ranking
accuracy. LEMR, supported by the insights from MoraBench, provides a
cost-effective and accurate solution for model selection, especially valuable
in resource-constrained environments.
Related papers
- STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models [8.60556939977361]
We develop a benchmark for evaluating large language models (LLM) for microeconomic reasoning.
We focus on the logic of supply and demand, each grounded in up to $10$ domains, $5$ perspectives, and $3$ types.
We demonstrate the usefulness of our benchmark via a case study on $27$ LLMs, ranging from small open-source models to the current state of the art.
arXiv Detail & Related papers (2025-02-18T18:42:09Z) - Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications.
Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.
We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models [71.8065384742686]
LMMS-EVAL is a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models.
LMMS-EVAL LITE is a pruned evaluation toolkit that emphasizes both coverage and efficiency.
Multimodal LIVEBENCH utilizes continuously updating news and online forums to assess models' generalization abilities in the wild.
arXiv Detail & Related papers (2024-07-17T17:51:53Z) - Diversified Batch Selection for Training Acceleration [68.67164304377732]
A prevalent research line, known as online batch selection, explores selecting informative subsets during the training process.
vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner.
We propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples.
arXiv Detail & Related papers (2024-06-07T12:12:20Z) - Large Language Model-guided Document Selection [23.673690115025913]
Large Language Model (LLM) pre-training exhausts an ever growing compute budget.
Recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs.
We explore a promising direction for scalable general-domain document selection.
arXiv Detail & Related papers (2024-06-07T04:52:46Z) - Which LLM to Play? Convergence-Aware Online Model Selection with
Time-Increasing Bandits [43.65904435249823]
We propose a time-increasing bandit algorithm TI-UCB, which effectively predicts the increase of model performances.
Our results highlight the importance of utilizing increasing-then-converging pattern for more efficient and economic model selection.
arXiv Detail & Related papers (2024-03-11T23:52:46Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - Contextual Active Model Selection [10.925932167673764]
We present an approach to actively select pre-trained models while minimizing labeling costs.
The objective is to adaptively select the best model to make a prediction while limiting label requests.
We propose CAMS, a contextual active model selection algorithm that relies on two novel components.
arXiv Detail & Related papers (2022-07-13T08:22:22Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.