Fixed and adaptive landmark sets for finite pseudometric spaces
- URL: http://arxiv.org/abs/2212.09826v1
- Date: Mon, 19 Dec 2022 19:53:33 GMT
- Title: Fixed and adaptive landmark sets for finite pseudometric spaces
- Authors: Jason Cory Brunson and Yara Skaf
- Abstract summary: "Lastfirst" based on ranked distances implies a cover comprising sets of uniform cardinality.
We perform benchmark tests and compare its performance to that of maxmin on feature detection and class prediction tasks.
We find that lastfirst achieves comparable performance on prediction tasks and outperforms maxmin on homology detection tasks.
- Score: 0.9137554315375919
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Topological data analysis (TDA) is an expanding field that leverages
principles and tools from algebraic topology to quantify structural features of
data sets or transform them into more manageable forms. As its theoretical
foundations have been developed, TDA has shown promise in extracting useful
information from high-dimensional, noisy, and complex data such as those used
in biomedicine. To operate efficiently, these techniques may employ landmark
samplers, either random or heuristic. The heuristic maxmin procedure obtains a
roughly even distribution of sample points by implicitly constructing a cover
comprising sets of uniform radius. However, issues arise with data that vary in
density or include points with multiplicities, as are common in biomedicine. We
propose an analogous procedure, "lastfirst" based on ranked distances, which
implies a cover comprising sets of uniform cardinality. We first rigorously
define the procedure and prove that it obtains landmarks with desired
properties. We then perform benchmark tests and compare its performance to that
of maxmin, on feature detection and class prediction tasks involving simulated
and real-world biomedical data. Lastfirst is more general than maxmin in that
it can be applied to any data on which arbitrary (and not necessarily
symmetric) pairwise distances can be computed. Lastfirst is more
computationally costly, but our implementation scales at the same rate as
maxmin. We find that lastfirst achieves comparable performance on prediction
tasks and outperforms maxmin on homology detection tasks. Where the numerical
values of similarity measures are not meaningful, as in many biomedical
contexts, lastfirst sampling may also improve interpretability.
Related papers
- Bias Detection via Maximum Subgroup Discrepancy [2.236957801565796]
We propose a new notion of distance, the Maximum Subgroup Discrepancy (MSD)
In this metric, two distributions are close if, roughly, discrepancies are low for all feature subgroups.
We provide a practical algorithm for the evaluation of the distance, based on Mixed-integer optimization (MIO)
arXiv Detail & Related papers (2025-02-04T11:01:03Z) - Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance.
We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks.
We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z) - Automating Data Science Pipelines with Tensor Completion [4.956678070210018]
We model data science pipelines as instances of tensor completion.
The goal is to identify all missing entries of the tensor, corresponding to all combinations of variable values.
We extensively evaluate existing and proposed methods in a number of datasets.
arXiv Detail & Related papers (2024-10-08T22:34:08Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Topological Quality of Subsets via Persistence Matching Diagrams [0.196629787330046]
We measure the quality of a subset concerning the dataset it represents using topological data analysis techniques.
In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.
arXiv Detail & Related papers (2023-06-04T17:08:41Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - A Fair Experimental Comparison of Neural Network Architectures for
Latent Representations of Multi-Omics for Drug Response Prediction [7.690774882108066]
We train and optimize multi-omics integration methods under equal conditions.
We devised a novel method, Omics Stacking, that combines the advantages of intermediate and late integration.
Experiments were conducted on a public drug response data set with multiple omics data.
arXiv Detail & Related papers (2022-08-31T12:46:08Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.