PEAKS: Selecting Key Training Examples Incrementally via Prediction Error Anchored by Kernel Similarity
- URL: http://arxiv.org/abs/2504.05250v2
- Date: Tue, 08 Apr 2025 02:48:22 GMT
- Title: PEAKS: Selecting Key Training Examples Incrementally via Prediction Error Anchored by Kernel Similarity
- Authors: Mustafa Burak Gurbuz, Xingyu Zheng, Constantine Dovrolis,
- Abstract summary: We study the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source.<n>We propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS.<n>Our evaluations demonstrate that PEAKS consistently outperforms existing selection strategies.
- Score: 6.6157730528755065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As deep learning continues to be driven by ever-larger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets.
Related papers
- FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors [50.131271229165165]
Federated Learning (FL) has emerged as a promising framework for distributed machine learning.
Data heterogeneity resulting from differences across user behaviors, preferences, and device characteristics poses a significant challenge for federated learning.
We propose Adaptive Weight Aggregation (FedAWA), a novel method that adaptively adjusts aggregation weights based on client vectors during the learning process.
arXiv Detail & Related papers (2025-03-20T04:49:40Z) - CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning [0.0]
Data preprocessing is a critical yet frequently neglected aspect of machine learning.<n>CleanSurvival is a reinforcement-learning-based solution for optimizing preprocessing pipelines.<n>It can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance.
arXiv Detail & Related papers (2025-02-06T10:33:37Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.<n>We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.<n>As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - Stochastic Gradient Descent with Adaptive Data [4.119418481809095]
gradient descent (SGD) is a powerful optimization technique that is particularly useful in online learning scenarios.
Applying SGD to policy optimization problems in operations research involves a distinct challenge: the policy changes the environment and thereby affects the data used to update the policy.
The influence of previous decisions on the data generated introduces bias in the gradient estimate, which presents a potential source of instability for online learning not present in the iid case.
We show that the convergence speed of SGD with adaptive data is largely similar to the classical iid setting, as long as the mixing time of the policy-induced dynamics is factored in.
arXiv Detail & Related papers (2024-10-02T02:58:32Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.