Discriminative Estimation of Total Variation Distance: A Fidelity Auditor for Generative Data
- URL: http://arxiv.org/abs/2405.15337v1
- Date: Fri, 24 May 2024 08:18:09 GMT
- Title: Discriminative Estimation of Total Variation Distance: A Fidelity Auditor for Generative Data
- Authors: Lan Tao, Shirong Xu, Chi-Hua Wang, Namjoon Suh, Guang Cheng,
- Abstract summary: We propose a discriminative approach to estimate the total variation (TV) distance between two distributions.
Our method quantitatively characterizes the relation between the Bayes risk in classifying two distributions and their TV distance.
We demonstrate that, with a specific choice of hypothesis class in classification, a fast convergence rate in estimating the TV distance can be achieved.
- Score: 10.678533056953784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the proliferation of generative AI and the increasing volume of generative data (also called as synthetic data), assessing the fidelity of generative data has become a critical concern. In this paper, we propose a discriminative approach to estimate the total variation (TV) distance between two distributions as an effective measure of generative data fidelity. Our method quantitatively characterizes the relation between the Bayes risk in classifying two distributions and their TV distance. Therefore, the estimation of total variation distance reduces to that of the Bayes risk. In particular, this paper establishes theoretical results regarding the convergence rate of the estimation error of TV distance between two Gaussian distributions. We demonstrate that, with a specific choice of hypothesis class in classification, a fast convergence rate in estimating the TV distance can be achieved. Specifically, the estimation accuracy of the TV distance is proven to inherently depend on the separation of two Gaussian distributions: smaller estimation errors are achieved when the two Gaussian distributions are farther apart. This phenomenon is also validated empirically through extensive simulations. In the end, we apply this discriminative estimation method to rank fidelity of synthetic image data using the MNIST dataset.
Related papers
- Towards Self-Supervised Covariance Estimation in Deep Heteroscedastic Regression [102.24287051757469]
We study self-supervised covariance estimation in deep heteroscedastic regression.
We derive an upper bound on the 2-Wasserstein distance between normal distributions.
Experiments over a wide range of synthetic and real datasets demonstrate that the proposed 2-Wasserstein bound coupled with pseudo label annotations results in a computationally cheaper yet accurate deep heteroscedastic regression.
arXiv Detail & Related papers (2025-02-14T22:37:11Z) - A Uniform Concentration Inequality for Kernel-Based Two-Sample Statistics [4.757470449749877]
We show that these metrics can be unified under a general framework of kernel-based two-sample statistics.
This paper establishes a novel uniform concentration inequality for the aforementioned kernel-based statistics.
As illustrative applications, we demonstrate how these bounds facilitate the component of error bounds for procedures such as distance covariance-based dimension reduction.
arXiv Detail & Related papers (2024-05-22T22:41:56Z) - Synthetic Tabular Data Validation: A Divergence-Based Approach [8.062368743143388]
Divergences quantify discrepancies between data distributions.
Traditional approaches calculate divergences independently for each feature.
We propose a novel approach that uses divergence estimation to overcome the limitations of marginal comparisons.
arXiv Detail & Related papers (2024-05-13T15:07:52Z) - Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Evaluating Perceptual Distance Models by Fitting Binomial Distributions to Two-Alternative Forced Choice Data [47.18802526899955]
Crowd-sourced perceptual datasets have emerged, with no images shared between triplets, making ranking infeasible.
We statistically model the underlying decision-making process during 2AFC experiments using a binomial distribution.
We calculate meaningful and well-founded metrics for the distance model, beyond the mere prediction accuracy as percentage agreement.
arXiv Detail & Related papers (2024-03-15T15:21:04Z) - Uncertainty Quantification via Stable Distribution Propagation [60.065272548502]
We propose a new approach for propagating stable probability distributions through neural networks.
Our method is based on local linearization, which we show to be an optimal approximation in terms of total variation distance for the ReLU non-linearity.
arXiv Detail & Related papers (2024-02-13T09:40:19Z) - TIC-TAC: A Framework for Improved Covariance Estimation in Deep Heteroscedastic Regression [109.69084997173196]
Deepscedastic regression involves jointly optimizing the mean and covariance of the predicted distribution using the negative log-likelihood.
Recent works show that this may result in sub-optimal convergence due to the challenges associated with covariance estimation.
We study two questions: (1) Does the predicted covariance truly capture the randomness of the predicted mean?
Our results show that not only does TIC accurately learn the covariance, it additionally facilitates an improved convergence of the negative log-likelihood.
arXiv Detail & Related papers (2023-10-29T09:54:03Z) - Communication-Efficient Distributed Estimation and Inference for Cox's Model [4.731404257629232]
We develop communication-efficient iterative distributed algorithms for estimation and inference in the high-dimensional sparse Cox proportional hazards model.
To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method.
We provide valid and powerful distributed hypothesis tests for any coordinate element based on a decorrelated score test.
arXiv Detail & Related papers (2023-02-23T15:50:17Z) - Score Approximation, Estimation and Distribution Recovery of Diffusion
Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace.
We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated.
The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z) - DEMI: Discriminative Estimator of Mutual Information [5.248805627195347]
Estimating mutual information between continuous random variables is often intractable and challenging for high-dimensional data.
Recent progress has leveraged neural networks to optimize variational lower bounds on mutual information.
Our approach is based on training a classifier that provides the probability that a data sample pair is drawn from the joint distribution.
arXiv Detail & Related papers (2020-10-05T04:19:27Z) - Unlabelled Data Improves Bayesian Uncertainty Calibration under
Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation.
We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.