Post-hoc Models for Performance Estimation of Machine Learning Inference
- URL: http://arxiv.org/abs/2110.02459v1
- Date: Wed, 6 Oct 2021 02:20:37 GMT
- Title: Post-hoc Models for Performance Estimation of Machine Learning Inference
- Authors: Xuechen Zhang, Samet Oymak, Jiasi Chen
- Abstract summary: Estimating how well a machine learning model performs during inference is critical in a variety of scenarios.
We systematically generalize performance estimation to a diverse set of metrics and scenarios.
We find that proposed post-hoc models consistently outperform the standard confidence baselines.
- Score: 22.977047604404884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Estimating how well a machine learning model performs during inference is
critical in a variety of scenarios (for example, to quantify uncertainty, or to
choose from a library of available models). However, the standard accuracy
estimate of softmax confidence is not versatile and cannot reliably predict
different performance metrics (e.g., F1-score, recall) or the performance in
different application scenarios or input domains. In this work, we
systematically generalize performance estimation to a diverse set of metrics
and scenarios and discuss generalized notions of uncertainty calibration. We
propose the use of post-hoc models to accomplish this goal and investigate
design parameters, including the model type, feature engineering, and
performance metric, to achieve the best estimation quality. Emphasis is given
to object detection problems and, unlike prior work, our approach enables the
estimation of per-image metrics such as recall and F1-score. Through extensive
experiments with computer vision models and datasets in three use cases --
mobile edge offloading, model selection, and dataset shift -- we find that
proposed post-hoc models consistently outperform the standard calibrated
confidence baselines. To the best of our knowledge, this is the first work to
develop a unified framework to address different performance estimation
problems for machine learning inference.
Related papers
- Think Twice: Measuring the Efficiency of Eliminating Prediction
Shortcuts of Question Answering Models [3.9052860539161918]
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature.
We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA)
We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
arXiv Detail & Related papers (2023-05-11T14:35:00Z) - Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals.
Model-to-Match uses variable importance measurements to construct a distance metric.
We operationalize the Model-to-Match framework with LASSO.
arXiv Detail & Related papers (2023-02-23T00:43:03Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - Estimating Model Performance under Domain Shifts with Class-Specific
Confidence Scores [25.162667593654206]
We introduce class-wise calibration within the framework of performance estimation for imbalanced datasets.
We conduct experiments on four tasks and find the proposed modifications consistently improve the estimation accuracy for imbalanced datasets.
arXiv Detail & Related papers (2022-07-20T15:04:32Z) - Model Comparison and Calibration Assessment: User Guide for Consistent
Scoring Functions in Machine Learning and Actuarial Practice [0.0]
This user guide revisits and clarifies statistical techniques to assess the calibration or adequacy of a model.
It focuses mainly on the pedagogical presentation of existing results and of best practice.
Results are accompanied and illustrated by two real data case studies on workers' compensation and customer churn.
arXiv Detail & Related papers (2022-02-25T15:52:19Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z) - Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual
Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms.
We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance.
We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.