An optimal transport approach for selecting a representative subsample
with application in efficient kernel density estimation
- URL: http://arxiv.org/abs/2206.01182v1
- Date: Tue, 31 May 2022 05:19:29 GMT
- Title: An optimal transport approach for selecting a representative subsample
with application in efficient kernel density estimation
- Authors: Jingyi Zhang, Cheng Meng, Jun Yu, Mengrui Zhang, Wenxuan Zhong and
Ping Ma
- Abstract summary: Subsampling methods aim to select a subsample as a surrogate for the observed sample.
Existing model-free subsampling methods are usually built upon clustering techniques or kernel tricks.
We propose a novel model-free subsampling method by utilizing optimal transport techniques.
- Score: 21.632131776088084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subsampling methods aim to select a subsample as a surrogate for the observed
sample. Such methods have been used pervasively in large-scale data analytics,
active learning, and privacy-preserving analysis in recent decades. Instead of
model-based methods, in this paper, we study model-free subsampling methods,
which aim to identify a subsample that is not confined by model assumptions.
Existing model-free subsampling methods are usually built upon clustering
techniques or kernel tricks. Most of these methods suffer from either a large
computational burden or a theoretical weakness. In particular, the theoretical
weakness is that the empirical distribution of the selected subsample may not
necessarily converge to the population distribution. Such computational and
theoretical limitations hinder the broad applicability of model-free
subsampling methods in practice. We propose a novel model-free subsampling
method by utilizing optimal transport techniques. Moreover, we develop an
efficient subsampling algorithm that is adaptive to the unknown probability
density function. Theoretically, we show the selected subsample can be used for
efficient density estimation by deriving the convergence rate for the proposed
subsample kernel density estimator. We also provide the optimal bandwidth for
the proposed estimator. Numerical studies on synthetic and real-world datasets
demonstrate the performance of the proposed method is superior.
Related papers
- Total Uncertainty Quantification in Inverse PDE Solutions Obtained with Reduced-Order Deep Learning Surrogate Models [50.90868087591973]
We propose an approximate Bayesian method for quantifying the total uncertainty in inverse PDE solutions obtained with machine learning surrogate models.
We test the proposed framework by comparing it with the iterative ensemble smoother and deep ensembling methods for a non-linear diffusion equation.
arXiv Detail & Related papers (2024-08-20T19:06:02Z) - Dynamical Measure Transport and Neural PDE Solvers for Sampling [77.38204731939273]
We tackle the task of sampling from a probability density as transporting a tractable density function to the target.
We employ physics-informed neural networks (PINNs) to approximate the respective partial differential equations (PDEs) solutions.
PINNs allow for simulation- and discretization-free optimization and can be trained very efficiently.
arXiv Detail & Related papers (2024-07-10T17:39:50Z) - PQMass: Probabilistic Assessment of the Quality of Generative Models
using Probability Mass Estimation [8.527898482146103]
We propose a comprehensive sample-based method for assessing the quality of generative models.
The proposed approach enables the estimation of the probability that two sets of samples are drawn from the same distribution.
arXiv Detail & Related papers (2024-02-06T19:39:26Z) - Distributed Markov Chain Monte Carlo Sampling based on the Alternating
Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers.
We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art.
In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z) - Sobolev Space Regularised Pre Density Models [51.558848491038916]
We propose a new approach to non-parametric density estimation that is based on regularizing a Sobolev norm of the density.
This method is statistically consistent, and makes the inductive validation model clear and consistent.
arXiv Detail & Related papers (2023-07-25T18:47:53Z) - Plug-and-Play split Gibbs sampler: embedding deep generative priors in
Bayesian inference [12.91637880428221]
This paper introduces a plug-and-play sampling algorithm that leverages variable splitting to efficiently sample from a posterior distribution.
It divides the challenging task of posterior sampling into two simpler sampling problems.
Its performance is compared to recent state-of-the-art optimization and sampling methods.
arXiv Detail & Related papers (2023-04-21T17:17:51Z) - Model-free Subsampling Method Based on Uniform Designs [5.661822729320697]
We develop a low-GEFD data-driven subsampling method based on the existing uniform designs.
Our method keeps robust under diverse model specifications while other popular subsampling methods are under-performing.
arXiv Detail & Related papers (2022-09-08T07:47:56Z) - How Much is Enough? A Study on Diffusion Times in Score-based Generative
Models [76.76860707897413]
Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution.
We show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process.
arXiv Detail & Related papers (2022-06-10T15:09:46Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Maximum sampled conditional likelihood for informative subsampling [4.708378681950648]
Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited.
We propose to use the maximum maximum conditional likelihood estimator (MSCLE) based on the sampled data.
arXiv Detail & Related papers (2020-11-11T16:01:17Z) - Detangling robustness in high dimensions: composite versus
model-averaged estimation [11.658462692891355]
Robust methods, though ubiquitous in practice, are yet to be fully understood in the context of regularized estimation and high dimensions.
This paper provides a toolbox to further study robustness in these settings and focuses on prediction.
arXiv Detail & Related papers (2020-06-12T20:40:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.