Topological data quality via 0-dimensional persistence matching
- URL: http://arxiv.org/abs/2306.02411v2
- Date: Wed, 26 Jun 2024 13:37:58 GMT
- Title: Topological data quality via 0-dimensional persistence matching
- Authors: Álvaro Torras-Casas, Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz,
- Abstract summary: We propose to measure data quality for supervised learning using topological data analysis techniques.
We provide a novel topological invariant based on persistence matchings induced by inclusions and using $0$-dimensional persistent homology.
This approach enables us to explain why the chosen dataset will lead to poor performance.
- Score: 0.196629787330046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data quality is crucial for the successful training, generalization and performance of artificial intelligence models. We propose to measure data quality for supervised learning using topological data analysis techniques. Specifically, we provide a novel topological invariant based on persistence matchings induced by inclusions and using $0$-dimensional persistent homology. We show that such an invariant is stable. We provide an algorithm and relate it to images, kernels, and cokernels of the induced morphisms. Also, we show that the invariant allows us to understand whether the subset "represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. This approach enables us to explain why the chosen dataset will lead to poor performance.
Related papers
- Spectral Self-supervised Feature Selection [7.052728135831165]
We propose a self-supervised graph-based approach for unsupervised feature selection.
Our method's core involves computing robust pseudo-labels by applying simple processing steps to the graph Laplacian's eigenvectors.
Our approach is shown to be robust to challenging scenarios, such as the presence of outliers and complex substructures.
arXiv Detail & Related papers (2024-07-12T07:29:08Z) - Improving embedding of graphs with missing data by soft manifolds [51.425411400683565]
The reliability of graph embeddings depends on how much the geometry of the continuous space matches the graph structure.
We introduce a new class of manifold, named soft manifold, that can solve this situation.
Using soft manifold for graph embedding, we can provide continuous spaces to pursue any task in data analysis over complex datasets.
arXiv Detail & Related papers (2023-11-29T12:48:33Z) - Manifold Learning with Sparse Regularised Optimal Transport [0.17205106391379024]
Real-world datasets are subject to noisy observations and sampling, so that distilling information about the underlying manifold is a major challenge.
We propose a method for manifold learning that utilises a symmetric version of optimal transport with a quadratic regularisation.
We prove that the resulting kernel is consistent with a Laplace-type operator in the continuous limit, establish robustness to heteroskedastic noise and exhibit these results in simulations.
arXiv Detail & Related papers (2023-07-19T08:05:46Z) - VTAE: Variational Transformer Autoencoder with Manifolds Learning [144.0546653941249]
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables.
The nonlinearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning.
We show that geodesics and accurate computation can substantially improve the performance of deep generative models.
arXiv Detail & Related papers (2023-04-03T13:13:19Z) - RandomSCM: interpretable ensembles of sparse classifiers tailored for
omics data [59.4141628321618]
We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules.
The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.
arXiv Detail & Related papers (2022-08-11T13:55:04Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - Approximating Persistent Homology for Large Datasets [0.0]
Persistent homology produces a statistical summary in the form of a persistence diagram.
Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large.
We show that the mean of the persistence diagrams of subsamples is a valid approximation of the true persistent homology of the larger dataset.
arXiv Detail & Related papers (2022-04-19T23:07:27Z) - Data-heterogeneity-aware Mixing for Decentralized Learning [63.83913592085953]
We characterize the dependence of convergence on the relationship between the mixing weights of the graph and the data heterogeneity across nodes.
We propose a metric that quantifies the ability of a graph to mix the current gradients.
Motivated by our analysis, we propose an approach that periodically and efficiently optimize the metric.
arXiv Detail & Related papers (2022-04-13T15:54:35Z) - Data efficiency in graph networks through equivariance [1.713291434132985]
We introduce a novel architecture for graph networks which is equivariant to any transformation in the coordinate embeddings.
We show that, learning on a minimal amount of data, the architecture we propose can perfectly generalise to unseen data in a synthetic problem.
arXiv Detail & Related papers (2021-06-25T17:42:34Z) - Fuzzy c-Means Clustering for Persistence Diagrams [42.1666496315913]
We extend the ubiquitous Fuzzy c-Means (FCM) clustering algorithm to the space of persistence diagrams.
We show that our algorithm captures the topological structure of data without the topological prior knowledge.
In materials science, we classify transformed lattice structure datasets for the first time.
arXiv Detail & Related papers (2020-06-04T11:45:20Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.