Related papers: Smooth densities and generative modeling with unsupervised random forests

Smooth densities and generative modeling with unsupervised random forests

URL: http://arxiv.org/abs/2205.09435v1
Date: Thu, 19 May 2022 09:50:25 GMT
Title: Smooth densities and generative modeling with unsupervised random forests
Authors: David S. Watson, Kristin Blesch, Jan Kapar, Marvin N. Wright
Abstract summary: An important application for density estimators is synthetic data generation. We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints. We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators.
Score: 1.433758865948252
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Density estimation is a fundamental problem in statistics, and any attempt to do so in high dimensions typically requires strong assumptions or complex deep learning architectures. An important application for density estimators is synthetic data generation, an area currently dominated by neural networks that often demand enormous training datasets and extensive tuning. We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints, as well as generating realistic synthetic data. We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators, which generally rely on ill-chosen split criteria and do not scale well with data dimensionality. Experiments illustrate that our algorithm compares favorably to state-of-the-art deep learning generative models, achieving superior performance in a range of benchmark trials while executing about two orders of magnitude faster on average. Our method is implemented in easy-to-use $\texttt{R}$ and Python packages.

Related papers

Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z)
Towards a framework on tabular synthetic data generation: a minimalist approach: theory, use cases, and limitations [0.7227323884094953]
The framework is applied to high dimensional simulated credit scoring data which parallels real-life financial applications. We show that the method is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits.
arXiv Detail & Related papers (2024-11-17T06:37:54Z)
A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models [12.636148533844882]
Estimating the local dimension intrinsic (LID) of a low-dimensional submanifold is a longstanding problem. In this work, we show that the Fokker-Planck equation associated with a diffusion model can provide an LID estimator. Applying FLIPD to synthetic LID estimation benchmarks, we find that DMs implemented as fully-connected networks are highly effective LID estimators.
arXiv Detail & Related papers (2024-06-05T18:00:02Z)
Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
Group Distributionally Robust Dataset Distillation with Risk Minimization [17.05513836324578]
We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD. We provide a theoretical rationale for our approach and demonstrate its effective generalization and robustness across subgroups.
arXiv Detail & Related papers (2024-02-07T09:03:04Z)
Modular Learning of Deep Causal Generative Models for High-dimensional Causal Inference [5.522612010562183]
Modular-DCM is the first algorithm that, given the causal structure, uses adversarial training to learn the network weights. We show our algorithm's convergence on the COVIDx dataset and its utility with a causal invariant prediction problem on CelebA-HQ.
arXiv Detail & Related papers (2024-01-02T20:31:15Z)
Generative Forests [23.554594285885273]
We focus on generative AI for a type of data that still represent one of the most prevalent form of data: tabular data. Our paper introduces a new powerful class of forest-based models fit for such tasks and a simple training algorithm with strong convergence guarantees. Additional experiments on these tasks reveal that our models can be notably good contenders to diverse state of the art methods.
arXiv Detail & Related papers (2023-08-07T14:58:53Z)
Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z)
DeepBayes -- an estimator for parameter estimation in stochastic nonlinear dynamical models [11.917949887615567]
We propose DeepBayes estimators that leverage the power of deep recurrent neural networks in learning an estimator. The deep recurrent neural network architectures can be trained offline and ensure significant time savings during inference. We demonstrate the applicability of our proposed method on different example models and perform detailed comparisons with state-of-the-art approaches.
arXiv Detail & Related papers (2022-05-04T18:12:17Z)
Distributed Dynamic Safe Screening Algorithms for Sparse Regularization [73.85961005970222]
We propose a new distributed dynamic safe screening (DDSS) method for sparsity regularized models and apply it on shared-memory and distributed-memory architecture respectively. We prove that the proposed method achieves the linear convergence rate with lower overall complexity and can eliminate almost all the inactive features in a finite number of iterations almost surely.
arXiv Detail & Related papers (2022-04-23T02:45:55Z)
A Local Similarity-Preserving Framework for Nonlinear Dimensionality Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction. To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points. Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z)
Neural Approximate Sufficient Statistics for Implicit Models [34.44047460667847]
We frame the task of constructing sufficient statistics as learning mutual information maximizing representations of the data with the help of deep neural networks. We apply our approach to both traditional approximate Bayesian computation and recent neural likelihood methods, boosting their performance on a range of tasks.
arXiv Detail & Related papers (2020-10-20T07:11:40Z)
Variable Skipping for Autoregressive Range Density Estimation [84.60428050170687]
We show a technique, variable skipping, for accelerating range density estimation over deep autoregressive models. We show that variable skipping provides 10-100$times$ efficiency improvements when targeting challenging high-quantile error metrics.
arXiv Detail & Related papers (2020-07-10T19:01:40Z)
Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction. We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data. Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.