LARP: Learner-Agnostic Robust Data Prefiltering
- URL: http://arxiv.org/abs/2506.20573v3
- Date: Thu, 10 Jul 2025 08:40:09 GMT
- Title: LARP: Learner-Agnostic Robust Data Prefiltering
- Authors: Kristian Minchev, Dimitar Iliev Dimitrov, Nikola Konstantinov,
- Abstract summary: We formalize the problem of Learner-Agnostic Robust data Prefiltering (LARP)<n>Our theoretical results indicate that performing LARP on a heterogeneous set of learners leads to some loss in model performance.<n>We explore the resulting utility loss and its dependence on the problem parameters.
- Score: 5.530212768657544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread availability of large public datasets is a key factor behind the recent successes of statistical inference and machine learning methods. However, these datasets often contain some low-quality or contaminated data, to which many learning procedures are sensitive. Therefore, the question of whether and how public datasets should be prefiltered to facilitate accurate downstream learning arises. On a technical level this requires the construction of principled data prefiltering methods which are learner-agnostic robust, in the sense of provably protecting a set of pre-specified downstream learners from corrupted data. In this work, we formalize the problem of Learner-Agnostic Robust data Prefiltering (LARP), which aims at finding prefiltering procedures that minimize a worst-case loss over a pre-specified set of learners. We first instantiate our framework in the context of scalar mean estimation with Huber estimators under the Huber data contamination model. We provide a hardness result on a specific problem instance and analyze several natural prefiltering procedures. Our theoretical results indicate that performing LARP on a heterogeneous set of learners leads to some loss in model performance compared to the alternative of prefiltering data for each learner/use-case individually. We explore the resulting utility loss and its dependence on the problem parameters via extensive experiments on real-world image and tabular data, observing statistically significant reduction in utility. Finally, we model the trade-off between the utility drop and the cost of repeated (learner-specific) prefiltering within a game-theoretic framework and showcase benefits of LARP for large datasets.
Related papers
- DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.<n>Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.<n>Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - On the Influence of Data Resampling for Deep Learning-Based Log Anomaly Detection: Insights and Recommendations [10.931620604044486]
This study provides an in-depth analysis of the impact of diverse data resampling methods on existingAD approaches.
We assess the performance of theseAD approaches across four datasets with different levels of class imbalance.
We evaluate the effectiveness of the data resampling methods when utilizing optimal resampling ratios of normal to abnormal data.
arXiv Detail & Related papers (2024-05-06T14:01:05Z) - Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning [28.059563581973432]
Large Language Models (LLMs) often have sensitive, private, or copyrighted data during pre-training.
LLMs unlearning aims to eliminate the influence of undesirable data from the pre-trained model.
We propose Negative Preference Optimization (NPO) as a simple alignment-inspired method that could efficiently unlearn a target dataset.
arXiv Detail & Related papers (2024-04-08T21:05:42Z) - Learned Causal Method Prediction [10.36548504177628]
We propose CAusal Method Predictor ( CAMP), a framework for predicting the best method for a given dataset.
We generate datasets from a diverse set of synthetic causal models, score the candidate methods, and train a model to directly predict the highest-scoring method for that dataset.
Our strategy learns to map implicit dataset properties to the best method in a data-driven manner.
arXiv Detail & Related papers (2023-11-07T13:39:17Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - Self-Supervised Learning for Data Scarcity in a Fatigue Damage
Prognostic Problem [0.0]
Self-Supervised Learning is a sub-category of unsupervised learning approaches.
This paper investigates whether pre-training DL models in a self-supervised way on unlabelled sensors data can be useful for Remaining Useful Life (RUL) estimation.
Results show that the self-supervised pre-trained models are able to significantly outperform the non-pre-trained models in downstream RUL prediction task.
arXiv Detail & Related papers (2023-01-20T06:45:32Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.