How Many Validation Labels Do You Need? Exploring the Design Space of
Label-Efficient Model Ranking
- URL: http://arxiv.org/abs/2312.01619v3
- Date: Sat, 17 Feb 2024 13:42:03 GMT
- Title: How Many Validation Labels Do You Need? Exploring the Design Space of
Label-Efficient Model Ranking
- Authors: Zhengyu Hu, Jieyu Zhang, Yue Yu, Yuchen Zhuang, Hui Xiong
- Abstract summary: This paper presents LEMR (Label-Efficient Model Ranking) and introduces the MoraBench Benchmark.
LEMR is a novel framework that minimizes the need for costly annotations in model selection by strategically annotating instances from an unlabeled validation set.
- Score: 40.39898960460575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents LEMR (Label-Efficient Model Ranking) and introduces the
MoraBench Benchmark. LEMR is a novel framework that minimizes the need for
costly annotations in model selection by strategically annotating instances
from an unlabeled validation set. To evaluate LEMR, we leverage the MoraBench
Benchmark, a comprehensive collection of model outputs across diverse
scenarios. Our extensive evaluation across 23 different NLP tasks in
semi-supervised learning, weak supervision, and prompt selection tasks
demonstrates LEMR's effectiveness in significantly reducing labeling costs. Key
findings highlight the impact of suitable ensemble methods, uncertainty
sampling strategies, and model committee selection in enhancing model ranking
accuracy. LEMR, supported by the insights from MoraBench, provides a
cost-effective and accurate solution for model selection, especially valuable
in resource-constrained environments.
Related papers
- SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models [4.875712300661656]
We present SCORE ($mathbfS$ystematic $mathbfCO$nsistency and $mathbfR$obustness $mathbfE$valuation), a comprehensive framework for non-adversarial evaluation of Large Language Models.
The SCORE framework evaluates models by repeatedly testing them on the same benchmarks in various setups to give a realistic estimate of their accuracy and consistency.
arXiv Detail & Related papers (2025-02-28T19:27:29Z) - STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models [8.60556939977361]
We develop a benchmark for evaluating large language models (LLM) for microeconomic reasoning.
We focus on the logic of supply and demand, each grounded in up to $10$ domains, $5$ perspectives, and $3$ types.
We demonstrate the usefulness of our benchmark via a case study on $27$ LLMs, ranging from small open-source models to the current state of the art.
arXiv Detail & Related papers (2025-02-18T18:42:09Z) - Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications.
Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.
We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - All models are wrong, some are useful: Model Selection with Limited Labels [49.62984196182567]
We introduce MODEL SELECTOR, a framework for label-efficient selection of pretrained classifiers.
We show that MODEL SELECTOR drastically reduces the need for labeled data while consistently picking the best or near-best performing model.
Our results further highlight the robustness of MODEL SELECTOR in model selection, as it reduces the labeling cost by up to 72.41% when selecting a near-best model.
arXiv Detail & Related papers (2024-10-17T14:45:56Z) - LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models [71.8065384742686]
LMMS-EVAL is a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models.
LMMS-EVAL LITE is a pruned evaluation toolkit that emphasizes both coverage and efficiency.
Multimodal LIVEBENCH utilizes continuously updating news and online forums to assess models' generalization abilities in the wild.
arXiv Detail & Related papers (2024-07-17T17:51:53Z) - Grade Score: Quantifying LLM Performance in Option Selection [0.0]
"Grade Score" is a novel metric designed to evaluate the consistency and fairness of Large Language Models (LLMs)
The Grade Score combines Entropy, which measures order bias, and Mode Frequency, which assesses choice stability.
The study explores techniques such as prompt engineering and option sampling strategies to optimize the Grade Score.
arXiv Detail & Related papers (2024-06-17T19:29:39Z) - Diversified Batch Selection for Training Acceleration [68.67164304377732]
A prevalent research line, known as online batch selection, explores selecting informative subsets during the training process.
vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner.
We propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples.
arXiv Detail & Related papers (2024-06-07T12:12:20Z) - Large Language Model-guided Document Selection [23.673690115025913]
Large Language Model (LLM) pre-training exhausts an ever growing compute budget.
Recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs.
We explore a promising direction for scalable general-domain document selection.
arXiv Detail & Related papers (2024-06-07T04:52:46Z) - Which LLM to Play? Convergence-Aware Online Model Selection with
Time-Increasing Bandits [43.65904435249823]
We propose a time-increasing bandit algorithm TI-UCB, which effectively predicts the increase of model performances.
Our results highlight the importance of utilizing increasing-then-converging pattern for more efficient and economic model selection.
arXiv Detail & Related papers (2024-03-11T23:52:46Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - Cost-Effective Online Contextual Model Selection [14.094350329970537]
We formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context.
The goal is to output the best model for any given context without obtaining an excessive amount of labels.
We propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection.
arXiv Detail & Related papers (2022-07-13T08:22:22Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.