Prototype Selection Using Topological Data Analysis
- URL: http://arxiv.org/abs/2511.04873v1
- Date: Thu, 06 Nov 2025 23:21:43 GMT
- Title: Prototype Selection Using Topological Data Analysis
- Authors: Jordan Eckert, Elvan Ceyhan, Henry Schenck,
- Abstract summary: Topological Prototype Selector (TPS) is a framework for selecting representative subsets (prototypes) from large datasets.<n>TPS significantly preserves or improves classification performance while substantially reducing data size.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been an explosion in statistical learning literature to represent data using topological principles to capture structure and relationships. We propose a topological data analysis (TDA)-based framework, named Topological Prototype Selector (TPS), for selecting representative subsets (prototypes) from large datasets. We demonstrate the effectiveness of TPS on simulated data under different data intrinsic characteristics, and compare TPS against other currently used prototype selection methods in real data settings. In all simulated and real data settings, TPS significantly preserves or improves classification performance while substantially reducing data size. These contributions advance both algorithmic and geometric aspects of prototype learning and offer practical tools for parallelized, interpretable, and efficient classification.
Related papers
- Feature-based morphological analysis of shape graph data [4.449113067578087]
This paper introduces and demonstrates a computational pipeline for the statistical analysis of shape graph datasets.<n>Our purpose is not only to retrieve and distinguish variations in the connectivity structure of the data but also geometric differences of the network branches.
arXiv Detail & Related papers (2026-02-18T01:11:15Z) - Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation.
In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model.
We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - Natural Language-Based Synthetic Data Generation for Cluster Analysis [4.13592995550836]
Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms.<n>We propose synthetic data generation based on direct specification of high-level scenarios.<n>Our open-source Python package repliclust implements this workflow.
arXiv Detail & Related papers (2023-03-24T23:45:27Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - On topological data analysis for structural dynamics: an introduction to
persistent homology [0.0]
Topological data analysis is a method of quantifying the shape of data over a range of length scales.
Persistent homology is a method of quantifying the shape of data over a range of length scales.
arXiv Detail & Related papers (2022-09-12T10:39:38Z) - Capturing patterns of variation unique to a specific dataset [68.8204255655161]
We propose a tuning-free method that identifies low-dimensional representations of a target dataset relative to one or more comparison datasets.
We show in several experiments that UCA with a single background dataset achieves similar results compared to cPCA with various tuning parameters.
arXiv Detail & Related papers (2021-04-16T15:07:32Z) - Joint Geometric and Topological Analysis of Hierarchical Datasets [7.098759778181621]
In this paper, we focus on high-dimensional data that are organized into several hierarchical datasets.
The main novelty in this work lies in the combination of two powerful data-analytic approaches: topological data analysis and geometric manifold learning.
We show that our new method gives rise to superior classification results compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-04-03T13:02:00Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Data Segmentation via t-SNE, DBSCAN, and Random Forest [0.0]
This research proposes a data segmentation algorithm which separates data into natural clusters and produces a characteristic profile of each cluster based on the most important features.
We describe the algorithm and provide case studies using the Iris and MNIST data sets, as well as real social media site data from Instagram.
arXiv Detail & Related papers (2020-10-26T15:59:15Z) - Classification Algorithm of Speech Data of Parkinsons Disease Based on
Convolution Sparse Kernel Transfer Learning with Optimal Kernel and Parallel
Sample Feature Selection [14.1270098940551]
A novel PD classification algorithm based on sparse kernel transfer learning is proposed.
Sparse transfer learning is used to extract structural information of PD speech features from public datasets.
The proposed algorithm achieves obvious improvements in classification accuracy.
arXiv Detail & Related papers (2020-02-10T13:20:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.