Tight basis cycle representatives for persistent homology of large data
sets
- URL: http://arxiv.org/abs/2206.02925v1
- Date: Mon, 6 Jun 2022 22:00:42 GMT
- Title: Tight basis cycle representatives for persistent homology of large data
sets
- Authors: Manu Aggarwal, Vipul Periwal
- Abstract summary: Persistent homology (PH) is a popular tool for topological data analysis that has found applications across diverse areas of research.
Although powerful in theory, PH suffers from high computation cost that precludes its application to large data sets.
We provide a strategy and algorithms to compute tight representative boundaries around nontrivial robust features in large data sets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Persistent homology (PH) is a popular tool for topological data analysis that
has found applications across diverse areas of research. It provides a rigorous
method to compute robust topological features in discrete experimental
observations that often contain various sources of uncertainties. Although
powerful in theory, PH suffers from high computation cost that precludes its
application to large data sets. Additionally, most analyses using PH are
limited to computing the existence of nontrivial features. Precise localization
of these features is not generally attempted because, by definition, localized
representations are not unique and because of even higher computation cost. For
scientific applications, such a precise location is a sine qua non for
determining functional significance. Here, we provide a strategy and algorithms
to compute tight representative boundaries around nontrivial robust features in
large data sets. To showcase the efficiency of our algorithms and the precision
of computed boundaries, we analyze three data sets from different scientific
fields. In the human genome, we found an unexpected effect on loops through
chromosome 13 and the sex chromosomes, upon impairment of chromatin loop
formation. In a distribution of galaxies in the universe, we found
statistically significant voids. In protein homologs with significantly
different topology, we found voids attributable to ligand-interaction,
mutation, and differences between species.
Related papers
- Nonparametric independence tests in high-dimensional settings, with applications to the genetics of complex disease [55.2480439325792]
We show how defining adequate premetric structures on the support spaces of the genetic data allows for novel approaches to such testing.
For each problem, we provide mathematical results, simulations and the application to real data.
arXiv Detail & Related papers (2024-07-29T01:00:53Z) - Persistent Homology for High-dimensional Data Based on Spectral Methods [16.58218530585593]
We show that persistent homology becomes very sensitive to noise and fails to detect the correct topology.
We find that spectral distances on the k-nearest-neighbor graph of the data, such as diffusion distance and effective resistance, allow to detect the correct topology even in the presence of high-dimensional noise.
arXiv Detail & Related papers (2023-11-06T13:18:08Z) - Non-isotropic Persistent Homology: Leveraging the Metric Dependency of
PH [5.70896453969985]
We show that information on the point cloud is lost when restricting persistent homology to a single distance function.
We numerically show that non-isotropic persistent homology can extract information on orientation, orientational variance, and scaling of randomly generated point clouds.
arXiv Detail & Related papers (2023-10-25T08:03:17Z) - Geodesic Sinkhorn for Fast and Accurate Optimal Transport on Manifolds [53.110934987571355]
We propose Geodesic Sinkhorn -- based on a heat kernel on a manifold graph.
We apply our method to the computation of barycenters of several distributions of high dimensional single cell data from patient samples undergoing chemotherapy.
arXiv Detail & Related papers (2022-11-02T00:51:35Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - On the effectiveness of persistent homology [0.9208007322096533]
Persistent homology (PH) is one of the most popular methods in Topological Data Analysis.
The goal of this work is to identify some types of problems on which PH performs well or even better than other methods in data analysis.
arXiv Detail & Related papers (2022-06-21T17:30:27Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Dory: Overcoming Barriers to Computing Persistent Homology [0.0]
We present Dory, an efficient and scalable algorithm that can compute the persistent homology of large data sets.
As an application, we compute the PH of the human genome at high resolution as revealed by a genome-wide Hi-C data set.
arXiv Detail & Related papers (2021-03-09T18:28:22Z) - Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem.
Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples.
Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv Detail & Related papers (2020-10-14T02:36:27Z) - Self-training Avoids Using Spurious Features Under Domain Shift [54.794607791641745]
In unsupervised domain adaptation, conditional entropy minimization and pseudo-labeling work even when the domain shifts are much larger than those analyzed by existing theory.
We identify and analyze one particular setting where the domain shift can be large, but certain spurious features correlate with label in the source domain but are independent label in the target.
arXiv Detail & Related papers (2020-06-17T17:51:42Z) - Improved guarantees and a multiple-descent curve for Column Subset
Selection and the Nystr\"om method [76.73096213472897]
We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees.
Our approach leads to significantly better bounds for datasets with known rates of singular value decay.
We show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.
arXiv Detail & Related papers (2020-02-21T00:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.