Can You Rely on Your Model Evaluation? Improving Model Evaluation with
Synthetic Test Data
- URL: http://arxiv.org/abs/2310.16524v1
- Date: Wed, 25 Oct 2023 10:18:44 GMT
- Title: Can You Rely on Your Model Evaluation? Improving Model Evaluation with
Synthetic Test Data
- Authors: Boris van Breugel, Nabeel Seedat, Fergus Imrie, Mihaela van der Schaar
- Abstract summary: We introduce 3S Testing, a deep generative modeling framework to facilitate model evaluation.
Our experiments demonstrate that 3S Testing outperforms traditional baselines.
These results raise the question of whether we need a paradigm shift away from limited real test data towards synthetic test data.
- Score: 75.20035991513564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating the performance of machine learning models on diverse and
underrepresented subgroups is essential for ensuring fairness and reliability
in real-world applications. However, accurately assessing model performance
becomes challenging due to two main issues: (1) a scarcity of test data,
especially for small subgroups, and (2) possible distributional shifts in the
model's deployment setting, which may not align with the available test data.
In this work, we introduce 3S Testing, a deep generative modeling framework to
facilitate model evaluation by generating synthetic test sets for small
subgroups and simulating distributional shifts. Our experiments demonstrate
that 3S Testing outperforms traditional baselines -- including real test data
alone -- in estimating model performance on minority subgroups and under
plausible distributional shifts. In addition, 3S offers intervals around its
performance estimates, exhibiting superior coverage of the ground truth
compared to existing approaches. Overall, these results raise the question of
whether we need a paradigm shift away from limited real test data towards
synthetic test data.
Related papers
- Deep anytime-valid hypothesis testing [29.273915933729057]
We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems.
We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework.
Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
arXiv Detail & Related papers (2023-10-30T09:46:19Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Evaluation of Categorical Generative Models -- Bridging the Gap Between
Real and Synthetic Data [18.142397311464343]
We introduce an appropriately scalable evaluation method for generative models.
We consider increasingly large probability spaces, which correspond to increasingly difficult modeling tasks.
We validate our evaluation procedure with synthetic experiments on both synthetic generative models and current state-of-the-art categorical generative models.
arXiv Detail & Related papers (2022-10-28T21:05:25Z) - A Simple Unified Approach to Testing High-Dimensional Conditional
Independences for Categorical and Ordinal Data [0.26651200086513094]
Conditional independence (CI) tests underlie many approaches to model testing and structure learning in causal inference.
Most existing CI tests for categorical and ordinal data stratify the sample by the conditioning variables, perform simple independence tests in each stratum, and combine the results.
Here we propose a simple unified CI test for ordinal and categorical data that maintains reasonable calibration and power in high dimensions.
arXiv Detail & Related papers (2022-06-09T08:56:12Z) - Efficient Test-Time Model Adaptation without Forgetting [60.36499845014649]
Test-time adaptation seeks to tackle potential distribution shifts between training and testing data.
We propose an active sample selection criterion to identify reliable and non-redundant samples.
We also introduce a Fisher regularizer to constrain important model parameters from drastic changes.
arXiv Detail & Related papers (2022-04-06T06:39:40Z) - Discovering Distribution Shifts using Latent Space Representations [4.014524824655106]
It is non-trivial to assess a model's generalizability to new, candidate datasets.
We use embedding space geometry to propose a non-parametric framework for detecting distribution shifts.
arXiv Detail & Related papers (2022-02-04T19:00:16Z) - Understanding and Testing Generalization of Deep Networks on
Out-of-Distribution Data [30.471871571256198]
Deep network models perform excellently on In-Distribution data, but can significantly fail on Out-Of-Distribution data.
This study is devoted to analyzing the problem of experimental ID test and designing OOD test paradigm.
arXiv Detail & Related papers (2021-11-17T15:29:07Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.