OMNIINPUT: A Model-centric Evaluation Framework through Output
Distribution
- URL: http://arxiv.org/abs/2312.03291v1
- Date: Wed, 6 Dec 2023 04:53:12 GMT
- Title: OMNIINPUT: A Model-centric Evaluation Framework through Output
Distribution
- Authors: Weitang Liu, Ying Wai Li, Tianle Wang, Yi-Zhuang You, Jingbo Shang
- Abstract summary: We propose a model-centric evaluation framework, OmniInput, to evaluate the quality of an AI/ML model's predictions on all possible inputs.
We employ an efficient sampler to obtain representative inputs and the output distribution of the trained model.
Our experiments demonstrate that OmniInput enables a more fine-grained comparison between models.
- Score: 31.00645110294068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel model-centric evaluation framework, OmniInput, to evaluate
the quality of an AI/ML model's predictions on all possible inputs (including
human-unrecognizable ones), which is crucial for AI safety and reliability.
Unlike traditional data-centric evaluation based on pre-defined test sets, the
test set in OmniInput is self-constructed by the model itself and the model
quality is evaluated by investigating its output distribution. We employ an
efficient sampler to obtain representative inputs and the output distribution
of the trained model, which, after selective annotation, can be used to
estimate the model's precision and recall at different output values and a
comprehensive precision-recall curve. Our experiments demonstrate that
OmniInput enables a more fine-grained comparison between models, especially
when their performance is almost the same on pre-defined datasets, leading to
new findings and insights for how to train more robust, generalizable models.
Related papers
- Verifiable evaluations of machine learning models using zkSNARKs [40.538081946945596]
This work presents a method of verifiable model evaluation using model inference through zkSNARKs.
The resulting zero-knowledge computational proofs of model outputs over datasets can be packaged into verifiable evaluation attestations.
For the first time, we demonstrate this across a sample of real-world models and highlight key challenges and design solutions.
arXiv Detail & Related papers (2024-02-05T02:21:11Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Provable Robustness for Streaming Models with a Sliding Window [51.85182389861261]
In deep learning applications such as online content recommendation and stock market analysis, models use historical data to make predictions.
We derive robustness certificates for models that use a fixed-size sliding window over the input stream.
Our guarantees hold for the average model performance across the entire stream and are independent of stream size, making them suitable for large data streams.
arXiv Detail & Related papers (2023-03-28T21:02:35Z) - Evaluating Representations with Readout Model Switching [18.475866691786695]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - PAMI: partition input and aggregate outputs for model interpretation [69.42924964776766]
In this study, a simple yet effective visualization framework called PAMI is proposed based on the observation that deep learning models often aggregate features from local regions for model predictions.
The basic idea is to mask majority of the input and use the corresponding model output as the relative contribution of the preserved input part to the original model prediction.
Extensive experiments on multiple tasks confirm the proposed method performs better than existing visualization approaches in more precisely finding class-specific input regions.
arXiv Detail & Related papers (2023-02-07T08:48:34Z) - Post-Selection Confidence Bounds for Prediction Performance [2.28438857884398]
In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks.
We propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set.
arXiv Detail & Related papers (2022-10-24T13:28:43Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - PSD2 Explainable AI Model for Credit Scoring [0.0]
The aim of this project is to develop and test advanced analytical methods to improve the prediction accuracy of Credit Risk Models.
The project focuses on applying an explainable machine learning model to bank-related databases.
arXiv Detail & Related papers (2020-11-20T12:12:38Z) - VAE-LIME: Deep Generative Model Based Approach for Local Data-Driven
Model Interpretability Applied to the Ironmaking Industry [70.10343492784465]
It is necessary to expose to the process engineer, not solely the model predictions, but also their interpretability.
Model-agnostic local interpretability solutions based on LIME have recently emerged to improve the original method.
We present in this paper a novel approach, VAE-LIME, for local interpretability of data-driven models forecasting the temperature of the hot metal produced by a blast furnace.
arXiv Detail & Related papers (2020-07-15T07:07:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.