Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks
- URL: http://arxiv.org/abs/2501.04234v1
- Date: Wed, 08 Jan 2025 02:17:34 GMT
- Title: Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks
- Authors: Rachel Longjohn, Giri Gopalan, Emily Casleton,
- Abstract summary: We show how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks.
These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance.
- Score: 0.0
- License:
- Abstract: Modern artificial intelligence is supported by machine learning models (e.g., foundation models) that are pretrained on a massive data corpus and then adapted to solve a variety of downstream tasks. To summarize performance across multiple tasks, evaluation metrics are often aggregated into a summary metric, e.g., average accuracy across 10 question-answering tasks. When aggregating evaluation metrics, it is useful to incorporate uncertainty in the aggregate metric in order to gain a more realistic understanding of model performance. Our objective in this work is to demonstrate how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks. The methods we emphasize are bootstrapping, Bayesian hierarchical (i.e., multilevel) modeling, and the visualization of task weightings that consider standard errors. These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance. We use a popular ML benchmark, the Visual Task Adaptation Benchmark (VTAB), to demonstrate the usefulness of our approaches.
Related papers
- Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.
We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.
We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics [37.86612817818566]
General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics.
We introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks with a total of 25 datasets.
Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks.
arXiv Detail & Related papers (2024-12-13T22:38:09Z) - FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation [73.454943870226]
Language models have shown impressive in-context-learning capabilities.
We propose a measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation.
arXiv Detail & Related papers (2024-06-17T06:14:55Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark.
Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions.
We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - BAMLD: Bayesian Active Meta-Learning by Disagreement [39.59987601426039]
This paper introduces an information-theoretic active task selection mechanism to decrease the number of labeling requests for meta-training tasks.
We report its empirical performance results that compare favourably against existing acquisition mechanisms.
arXiv Detail & Related papers (2021-10-19T13:06:51Z) - Post-hoc Models for Performance Estimation of Machine Learning Inference [22.977047604404884]
Estimating how well a machine learning model performs during inference is critical in a variety of scenarios.
We systematically generalize performance estimation to a diverse set of metrics and scenarios.
We find that proposed post-hoc models consistently outperform the standard confidence baselines.
arXiv Detail & Related papers (2021-10-06T02:20:37Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.