Density-Aware Farthest Point Sampling
- URL: http://arxiv.org/abs/2509.13213v1
- Date: Tue, 16 Sep 2025 16:19:14 GMT
- Title: Density-Aware Farthest Point Sampling
- Authors: Paolo Climaco, Jochen Garcke,
- Abstract summary: We introduce "Density-Aware Farthest Point Sampling" (DA-FPS), a novel sampling method.<n>We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance.<n>The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.
- Score: 2.9434930072968584
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set, a quantity we can estimate simply by considering the data features. We introduce "Density-Aware Farthest Point Sampling" (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.
Related papers
- Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice [109.9635246405237]
We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyper parameters.<n>We introduce a simple patch to the evaluation protocol: using reduced learning rates for proxy model training.<n> Empirically, we validate this approach across 23 data recipes covering four critical dimensions of data curation.
arXiv Detail & Related papers (2025-12-30T23:02:44Z) - Understanding Data Influence with Differential Approximation [63.817689230826595]
We introduce a new formulation to approximate a sample's influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In.<n>By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods.<n>Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators.
arXiv Detail & Related papers (2025-08-20T11:59:32Z) - Effective Data Pruning through Score Extrapolation [40.61665742457229]
We introduce a novel importance score extrapolation framework that requires training on only a small subset of data.<n>We present two initial approaches in this framework to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset.<n>Our results indicate that score extrapolation is a promising direction to scale expensive score calculation methods, such as pruning, data attribution, or other tasks.
arXiv Detail & Related papers (2025-06-10T17:38:49Z) - A Bayesian Approach to Data Point Selection [24.98069363998565]
Data point selection (DPS) is becoming a critical topic in deep learning.
Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation.
We propose a novel Bayesian approach to DPS.
arXiv Detail & Related papers (2024-11-06T09:04:13Z) - ReAugment: Model Zoo-Guided RL for Few-Shot Time Series Augmentation and Forecasting [74.00765474305288]
We present a pilot study on using reinforcement learning (RL) for time series data augmentation.<n>Our method, ReAugment, tackles three critical questions: which parts of the training set should be augmented, how the augmentation should be performed, and what advantages RL brings to the process.
arXiv Detail & Related papers (2024-09-10T07:34:19Z) - SwiftLearn: A Data-Efficient Training Method of Deep Learning Models
using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models.
This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages.
We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z) - Online Importance Sampling for Stochastic Gradient Optimization [33.42221341526944]
We propose a practical algorithm that efficiently computes data importance on-the-fly during training.<n>We also introduce a novel metric based on the derivative of the loss w.r.t. the network output, designed for mini-batch importance sampling.
arXiv Detail & Related papers (2023-11-24T13:21:35Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - On minimizing the training set fill distance in machine learning regression [0.552480439325792]
We study a data selection approach that aims to minimize the fill distance of the selected set.
We show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches.
arXiv Detail & Related papers (2023-07-20T16:18:33Z) - A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance.
We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset.
We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Variational Bayesian Unlearning [54.26984662139516]
We study the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased.
We show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief.
In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging.
arXiv Detail & Related papers (2020-10-24T11:53:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.