Predict Training Data Quality via Its Geometry in Metric Space
- URL: http://arxiv.org/abs/2510.15970v1
- Date: Sun, 12 Oct 2025 16:59:28 GMT
- Title: Predict Training Data Quality via Its Geometry in Metric Space
- Authors: Yang Ba, Mohammad Sadeq Abolhasani, Rong Pan,
- Abstract summary: We propose that the richness of representation and the elimination of redundancy within training data critically influence learning outcomes.<n>To investigate this, we employ persistent homology to extract topological features from data within a metric space.<n>Our findings highlight persistent homology as a powerful tool for analyzing and enhancing the training data that drives AI systems.
- Score: 7.056460460498077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality training data is the foundation of machine learning and artificial intelligence, shaping how models learn and perform. Although much is known about what types of data are effective for training, the impact of the data's geometric structure on model performance remains largely underexplored. We propose that both the richness of representation and the elimination of redundancy within training data critically influence learning outcomes. To investigate this, we employ persistent homology to extract topological features from data within a metric space, thereby offering a principled way to quantify diversity beyond entropy-based measures. Our findings highlight persistent homology as a powerful tool for analyzing and enhancing the training data that drives AI systems.
Related papers
- A Theory of the Mechanics of Information: Generalization Through Measurement of Uncertainty (Learning is Measuring) [0.0]
We introduce a model-free framework using surprisal (information theoretic uncertainty) to analyze and perform inferences from raw data.<n>It eliminates distribution modeling, reducing bias, and enabling efficient updates including direct edits and deletion of training data.<n>It emphasizes traceability, interpretability, and data-driven decision making, offering a unified, human-understandable framework for machine learning.
arXiv Detail & Related papers (2025-10-26T19:45:25Z) - Data Shift of Object Detection in Autonomous Driving [0.40792653193642503]
We study the data shift problem in autonomous driving object detection tasks.<n>We employ shift detection analysis techniques to perform dataset categorization and balancing.<n>To validate our approach, we optimize the model by integrating CycleGAN-based data augmentation techniques with the YOLOv5 framework.
arXiv Detail & Related papers (2025-08-16T01:52:31Z) - Benchmarking Federated Machine Unlearning methods for Tabular Data [9.30408906787193]
Machine unlearning enables a model to forget specific data upon request.<n>This paper presents a pioneering study on benchmarking machine unlearning methods within a federated setting.<n>We explore unlearning at the feature and instance levels, employing both machine learning, random forest and logistic regression models.
arXiv Detail & Related papers (2025-04-01T15:53:36Z) - Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection.<n>By leveraging influence scores, we effectively identify the most impactful data for system improvement.<n>We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.<n>We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.<n>As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - Sexism Detection on a Data Diet [14.899608305188002]
We show how we can leverage influence scores to estimate the importance of a data point while training a model.
We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets.
arXiv Detail & Related papers (2024-06-07T12:39:54Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Topological Quality of Subsets via Persistence Matching Diagrams [0.196629787330046]
We measure the quality of a subset concerning the dataset it represents using topological data analysis techniques.
In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.
arXiv Detail & Related papers (2023-06-04T17:08:41Z) - Towards Robust Dataset Learning [90.2590325441068]
We propose a principled, tri-level optimization to formulate the robust dataset learning problem.
Under an abstraction model that characterizes robust vs. non-robust features, the proposed method provably learns a robust dataset.
arXiv Detail & Related papers (2022-11-19T17:06:10Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Homogenization of Existing Inertial-Based Datasets to Support Human
Activity Recognition [8.076841611508486]
Several techniques have been proposed to address the problem of recognizing activities of daily living from signals.
Deep learning techniques applied to inertial signals have proven to be effective, achieving significant classification accuracy.
Research in human activity recognition models has been almost totally model-centric.
arXiv Detail & Related papers (2022-01-17T14:29:48Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.