Collaborative Prediction: To Join or To Disjoin Datasets
- URL: http://arxiv.org/abs/2506.11271v1
- Date: Thu, 12 Jun 2025 20:25:07 GMT
- Title: Collaborative Prediction: To Join or To Disjoin Datasets
- Authors: Kyung Rok Kim, Yansong Wang, Xiaocheng Li, Guanting Chen,
- Abstract summary: We study the problem of developing practical algorithms that select appropriate dataset to minimize population loss.<n>By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability.
- Score: 5.9697789282446605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even for simple prediction models. In this work, we study the problem of developing practical algorithms that select appropriate dataset to minimize population loss of our prediction model with high probability. Broadly speaking, we investigate when datasets from different sources can be effectively merged to enhance the predictive model's performance, and propose a practical algorithm with theoretical guarantees. By leveraging an oracle inequality and data-driven estimators, the algorithm reduces population loss with high probability. Numerical experiments demonstrate its effectiveness in both standard linear regression and broader machine learning applications. Code is available at https://github.com/kkrokii/collaborative_prediction.
Related papers
- Privacy-Preserving Methods for Bug Severity Prediction [0.0]
We investigate method-level bug severity prediction using source code metrics and Large Language Models.<n>We compare the performance of models trained using centralized learning, federated learning, and synthetic data generation.<n>Our finding highlights the potential of privacy-preserving approaches to enable effective bug severity prediction in industrial context.
arXiv Detail & Related papers (2025-06-28T04:40:51Z) - DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.<n>Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.<n>Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Hessian-Free Online Certified Unlearning [8.875278412741695]
We develop an online unlearning algorithm that achieves near-instantaneous data removal.<n>We prove that our proposed method outperforms the state-of-the-art methods in terms of the unlearning and generalization guarantees.
arXiv Detail & Related papers (2024-04-02T07:54:18Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Interpretable models for extrapolation in scientific machine learning [0.0]
Complex machine learning algorithms often outperform simple regressions in interpolative settings.
We examine the trade-off between model performance and interpretability across a broad range of science and engineering problems.
arXiv Detail & Related papers (2022-12-16T19:33:28Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Discrete-Valued Latent Preference Matrix Estimation with Graph Side
Information [12.836994708337144]
We develop an algorithm that matches the optimal sample complexity.
Our algorithm is robust to model errors and outperforms the existing algorithms in terms of prediction performance.
arXiv Detail & Related papers (2020-03-16T06:29:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.