How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models
- URL: http://arxiv.org/abs/2102.08921v1
- Date: Wed, 17 Feb 2021 18:25:30 GMT
- Title: How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models
- Authors: Ahmed M. Alaa, Boris van Breugel, Evgeny Saveliev, Mihaela van der
Schaar
- Abstract summary: We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
- Score: 95.8037674226622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Devising domain- and model-agnostic evaluation metrics for generative models
is an important and as yet unresolved problem. Most existing metrics, which
were tailored solely to the image synthesis setup, exhibit a limited capacity
for diagnosing the different modes of failure of generative models across
broader application domains. In this paper, we introduce a 3-dimensional
evaluation metric, ($\alpha$-Precision, $\beta$-Recall, Authenticity), that
characterizes the fidelity, diversity and generalization performance of any
generative model in a domain-agnostic fashion. Our metric unifies statistical
divergence measures with precision-recall analysis, enabling sample- and
distribution-level diagnoses of model fidelity and diversity. We introduce
generalization as an additional, independent dimension (to the
fidelity-diversity trade-off) that quantifies the extent to which a model
copies training data -- a crucial performance indicator when modeling sensitive
data with requirements on privacy. The three metric components correspond to
(interpretable) probabilistic quantities, and are estimated via sample-level
binary classification. The sample-level nature of our metric inspires a novel
use case which we call model auditing, wherein we judge the quality of
individual samples generated by a (black-box) model, discarding low-quality
samples and hence improving the overall model performance in a post-hoc manner.
Related papers
- Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification [3.1850615666574806]
This study investigates how consistent different metrics are at evaluating models across data of different prevalence.
I find that evaluation metrics that are less influenced by prevalence offer more consistent evaluation of individual models and more consistent ranking of a set of models.
arXiv Detail & Related papers (2024-08-19T17:52:38Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - GMValuator: Similarity-based Data Valuation for Generative Models [41.76259565672285]
We introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to provide data valuation for generation tasks.
GMValuator is extensively evaluated on various datasets and generative architectures to demonstrate its effectiveness.
arXiv Detail & Related papers (2023-04-21T02:02:02Z) - Feature Likelihood Divergence: Evaluating the Generalization of
Generative Models Using Samples [25.657798631897908]
Feature Likelihood Divergence provides a comprehensive trichotomic evaluation of generative models.
We empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail.
arXiv Detail & Related papers (2023-02-09T04:57:27Z) - Statistical Model Criticism of Variational Auto-Encoders [15.005894753472894]
We propose a framework for the statistical evaluation of variational auto-encoders (VAEs)
We test two instances of this framework in the context of modelling images of handwritten digits and a corpus of English text.
arXiv Detail & Related papers (2022-04-06T18:19:29Z) - A Unified Statistical Learning Model for Rankings and Scores with
Application to Grant Panel Review [1.240096657086732]
Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects.
Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously.
We propose the Mallows-Binomial model to close this gap, which combines a Mallows' $phi$ ranking model with Binomial score models.
arXiv Detail & Related papers (2022-01-07T16:56:52Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.