Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit
for Purpose?
- URL: http://arxiv.org/abs/2307.02732v1
- Date: Thu, 6 Jul 2023 02:31:38 GMT
- Title: Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit
for Purpose?
- Authors: Lu\'isa Shimabucoro, Timothy Hospedales, Henry Gouk
- Abstract summary: This paper presents the first investigation into task-level evaluation.
We measure the accuracy of performance estimators in the few-shot setting.
We examine the reasons for the failure of evaluators usually thought of as being robust.
- Score: 11.451691772914055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerous benchmarks for Few-Shot Learning have been proposed in the last
decade. However all of these benchmarks focus on performance averaged over many
tasks, and the question of how to reliably evaluate and tune models trained for
individual tasks in this regime has not been addressed. This paper presents the
first investigation into task-level evaluation -- a fundamental step when
deploying a model. We measure the accuracy of performance estimators in the
few-shot setting, consider strategies for model selection, and examine the
reasons for the failure of evaluators usually thought of as being robust. We
conclude that cross-validation with a low number of folds is the best choice
for directly estimating the performance of a model, whereas using bootstrapping
or cross validation with a large number of folds is better for model selection
purposes. Overall, we find that existing benchmarks for few-shot learning are
not designed in such a way that one can get a reliable picture of how
effectively methods can be used on individual tasks.
Related papers
- Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - When is an Embedding Model More Promising than Another? [33.540506562970776]
Embedders play a central role in machine learning, projecting any object into numerical representations that can be leveraged to perform various downstream tasks.
The evaluation of embedding models typically depends on domain-specific empirical approaches.
We present a unified approach to evaluate embedders, drawing upon the concepts of sufficiency and informativeness.
arXiv Detail & Related papers (2024-06-11T18:13:46Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - Effective Robustness against Natural Distribution Shifts for Models with
Different Training Data [113.21868839569]
"Effective robustness" measures the extra out-of-distribution robustness beyond what can be predicted from the in-distribution (ID) performance.
We propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data.
arXiv Detail & Related papers (2023-02-02T19:28:41Z) - Multi-Objective Model Selection for Time Series Forecasting [9.473440847947492]
We present a benchmark, evaluating 7 classical and 6 deep learning forecasting methods on 44 datasets.
We leverage the benchmark evaluations to learn good defaults that consider multiple objectives such as accuracy and latency.
By learning a mapping from forecasting models to performance metrics, we show that our method PARETOSELECT is able to accurately select models.
arXiv Detail & Related papers (2022-02-17T07:40:15Z) - Post-hoc Models for Performance Estimation of Machine Learning Inference [22.977047604404884]
Estimating how well a machine learning model performs during inference is critical in a variety of scenarios.
We systematically generalize performance estimation to a diverse set of metrics and scenarios.
We find that proposed post-hoc models consistently outperform the standard confidence baselines.
arXiv Detail & Related papers (2021-10-06T02:20:37Z) - Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual
Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms.
We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance.
We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z) - Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words"
Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.