Approximating Persistent Homology for Large Datasets
- URL: http://arxiv.org/abs/2204.09155v1
- Date: Tue, 19 Apr 2022 23:07:27 GMT
- Title: Approximating Persistent Homology for Large Datasets
- Authors: Yueqi Cao, Anthea Monod
- Abstract summary: Persistent homology produces a statistical summary in the form of a persistence diagram.
Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large.
We show that the mean of the persistence diagrams of subsamples is a valid approximation of the true persistent homology of the larger dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Persistent homology is an important methodology from topological data
analysis which adapts theory from algebraic topology to data settings and has
been successfully implemented in many applications. It produces a statistical
summary in the form of a persistence diagram, which captures the shape and size
of the data. Despite its widespread use, persistent homology is simply
impossible to implement when a dataset is very large. In this paper we address
the problem of finding a representative persistence diagram for prohibitively
large datasets. We adapt the classical statistical method of bootstrapping,
namely, drawing and studying smaller multiple subsamples from the large
dataset. We show that the mean of the persistence diagrams of subsamples --
taken as a mean persistence measure computed from the subsamples -- is a valid
approximation of the true persistent homology of the larger dataset. We give
the rate of convergence of the mean persistence diagram to the true persistence
diagram in terms of the number of subsamples and size of each subsample. Given
the complex algebraic and geometric nature of persistent homology, we adapt the
convexity and stability properties in the space of persistence diagrams
together with random set theory to achieve our theoretical results for the
general setting of point cloud data. We demonstrate our approach on simulated
and real data, including an application of shape clustering on complex
large-scale point cloud data.
Related papers
- Discovering symbolic expressions with parallelized tree search [59.92040079807524]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.
Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade when handling problems of complexity.
We introduce a parallelized tree search (PTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z) - Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation.
In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model.
We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z) - Discrete transforms of quantized persistence diagrams [0.5249805590164902]
We introduce Qupid, a novel and simple method for vectorizing persistence diagrams.
Key features are the choice of log-scaled grids that emphasize information contained near the diagonal in persistence diagrams.
We conduct an in-depth experimental analysis of Qupid, showing that the simplicity of our method results in very low computational costs.
arXiv Detail & Related papers (2023-12-28T16:11:11Z) - Improving embedding of graphs with missing data by soft manifolds [51.425411400683565]
The reliability of graph embeddings depends on how much the geometry of the continuous space matches the graph structure.
We introduce a new class of manifold, named soft manifold, that can solve this situation.
Using soft manifold for graph embedding, we can provide continuous spaces to pursue any task in data analysis over complex datasets.
arXiv Detail & Related papers (2023-11-29T12:48:33Z) - Sparse Training of Discrete Diffusion Models for Graph Generation [45.103518022696996]
We introduce SparseDiff, a novel diffusion model based on the observation that almost all large graphs are sparse.
By selecting a subset of edges, SparseDiff effectively leverages sparse graph representations both during the noising process and within the denoising network.
Our model demonstrates state-of-the-art performance across multiple metrics on both small and large datasets.
arXiv Detail & Related papers (2023-11-03T16:50:26Z) - RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching)
To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth.
We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z) - Conformal inference for regression on Riemannian Manifolds [49.7719149179179]
We investigate prediction sets for regression scenarios when the response variable, denoted by $Y$, resides in a manifold, and the covariable, denoted by X, lies in Euclidean space.
We prove the almost sure convergence of the empirical version of these regions on the manifold to their population counterparts.
arXiv Detail & Related papers (2023-10-12T10:56:25Z) - Manifold Learning with Sparse Regularised Optimal Transport [0.17205106391379024]
Real-world datasets are subject to noisy observations and sampling, so that distilling information about the underlying manifold is a major challenge.
We propose a method for manifold learning that utilises a symmetric version of optimal transport with a quadratic regularisation.
We prove that the resulting kernel is consistent with a Laplace-type operator in the continuous limit, establish robustness to heteroskedastic noise and exhibit these results in simulations.
arXiv Detail & Related papers (2023-07-19T08:05:46Z) - Topological Quality of Subsets via Persistence Matching Diagrams [0.196629787330046]
We measure the quality of a subset concerning the dataset it represents using topological data analysis techniques.
In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.
arXiv Detail & Related papers (2023-06-04T17:08:41Z) - DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained
Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states.
We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs.
Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z) - $k$-Means Clustering for Persistent Homology [0.0]
We prove convergence of the $k$-means clustering algorithm on persistence diagram space.
We also establish theoretical properties of the solution to the optimization problem in the Karush--Kuhn--Tucker framework.
arXiv Detail & Related papers (2022-10-18T17:18:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.