Related papers: SourceSplice: Source Selection for Machine Learning Tasks

SourceSplice: Source Selection for Machine Learning Tasks

URL: http://arxiv.org/abs/2507.22186v2
Date: Thu, 31 Jul 2025 18:46:06 GMT
Title: SourceSplice: Source Selection for Machine Learning Tasks
Authors: Ambarish Singh, Romila Pradhan,
Abstract summary: Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks.<n>This paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset.<n>We propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources.
Score: 3.3916160303055563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks - a challenge amplified by the deluge of data sources available in modern organizations. Prior work in data discovery largely focus on metadata matching, semantic similarity or identifying tables that should be joined to answer a particular query, but do not consider source quality for high performance of the downstream ML task. This paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset for a given ML task. We propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources that maximizes the utility of the downstream ML model. Both the algorithms rely on the core idea that sources (or their combinations) contribute differently to the task utility, and must be judiciously chosen. While SourceGrasp utilizes a metaheuristic based on a greediness criterion and randomization, the SourceSplice framework presents a source selection mechanism inspired from gene splicing - a core concept used in protein synthesis. We empirically evaluate our algorithms on three real-world datasets and synthetic datasets and show that, with significantly fewer subset explorations, SourceSplice effectively identifies subsets of data sources leading to high task utility. We also conduct studies reporting the sensitivity of SourceSplice to the decision choices under several settings.

Related papers

On the Power of Source Screening for Learning Shared Feature Extractors [33.10812756558517]
It is well understood that data sources with low relevance or poor quality may hinder representation learning.<n>We focus on the question of which data sources should be learned jointly by focusing on the traditionally deemed good'' collection of sources.<n>We find that source screening can play a central role in statistically optimal subspace estimation.
arXiv Detail & Related papers (2026-02-18T01:32:10Z)
MSRS: Evaluating Multi-Source Retrieval-Augmented Generation [51.717139132190574]
Many real-world applications demand the ability to integrate and summarize information scattered across multiple sources.<n>We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources.
arXiv Detail & Related papers (2025-08-28T14:59:55Z)
Transfer Learning for Matrix Completion [0.0]
We propose a transfer learning procedure given prior information on which source datasets are favorable.<n>With the source matrices close enough to the target matrix, out method outperforms the traditional method using the single target data.
arXiv Detail & Related papers (2025-07-03T02:43:40Z)
A Theoretical Framework for Data Efficient Multi-Source Transfer Learning Based on Cramér-Rao Bound [16.49737340580437]
We propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model?<n>Specifically, we introduce a generalization error measure that aligns with cross-entropy loss, and minimize it based on the Cram'er-Rao Bound to determine the optimal transfer quantity for each source task.<n>We develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for training deep multi-source transfer learning models.
arXiv Detail & Related papers (2025-02-06T17:32:49Z)
Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning.<n>We construct pseudo-skill clusters by grouping gradient-based sample vectors.<n>We select the best-performing data selector for each skill cluster from a pool of selector experts.<n>This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process [8.207427766052044]
The proposed approach is demonstrated on and analyzed through two mathematical and two materials science case studies. It is observed that compared to using single-source and source unaware machine learning models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems.
arXiv Detail & Related papers (2024-02-06T16:54:59Z)
Source-Free Collaborative Domain Adaptation via Multi-Perspective Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis. Many methods have been proposed to reduce fMRI heterogeneity between source and target domains. But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies. We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z)
Creating Synthetic Datasets for Collaborative Filtering Recommender Systems using Generative Adversarial Networks [1.290382979353427]
Research and education in machine learning needs diverse, representative, and open datasets to handle the necessary training, validation, and testing tasks. To feed this research variety, it is necessary and convenient to reinforce the existing datasets with synthetic ones. This paper proposes a Generative Adversarial Network (GAN)-based method to generate collaborative filtering datasets.
arXiv Detail & Related papers (2023-03-02T14:23:27Z)
Source data selection for out-of-domain generalization [0.76146285961466]
Poor selection of a source dataset can lead to poor performance on the target. We propose two source selection methods that are based on the multi-bandit theory and random search. Our proposals can be viewed as diagnostics for the existence of a reweighted source subsamples that perform better than the random selection of available samples.
arXiv Detail & Related papers (2022-02-04T14:37:31Z)
Optimal Data Selection: An Online Distributed View [61.31708750038692]
We develop algorithms for the online and distributed version of the problem. We show that our selection methods outperform random selection by $5-20%$. In learning tasks on ImageNet and MNIST, we show that our selection methods outperform random selection by $5-20%$.
arXiv Detail & Related papers (2022-01-25T18:56:16Z)
Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching. AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage. Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z)
Unsupervised Multi-source Domain Adaptation Without Access to Source Data [58.551861130011886]
Unsupervised Domain Adaptation (UDA) aims to learn a predictor model for an unlabeled domain by transferring knowledge from a separate labeled source domain. We propose a novel and efficient algorithm which automatically combines the source models with suitable weights in such a way that it performs at least as good as the best source model.
arXiv Detail & Related papers (2021-04-05T10:45:12Z)
MISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs [0.0]
This paper addresses black-box optimization over multiple information sources whose both fidelity and query cost change over the search space, that is they are location dependent. The approach uses: (i) an Augmented Gaussian Process, recently proposed in multi-information source optimization as a single model of the objective function over search space and sources, and (ii) a Gaussian Process to model the location-dependent cost of each source.
arXiv Detail & Related papers (2021-02-09T17:04:17Z)
Resource Allocation via Model-Free Deep Learning in Free Space Optical Communications [119.81868223344173]
The paper investigates the general problem of resource allocation for mitigating channel fading effects in Free Space Optical (FSO) communications. Under this framework, we propose two algorithms that solve FSO resource allocation problems.
arXiv Detail & Related papers (2020-07-27T17:38:51Z)
Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation [102.67010690592011]
Unsupervised adaptationUDA (UDA) aims to leverage the knowledge learned from a labeled source dataset to solve similar tasks in a new unlabeled domain. Prior UDA methods typically require to access the source data when learning to adapt the model. This work tackles a practical setting where only a trained source model is available and how we can effectively utilize such a model without source data to solve UDA problems.
arXiv Detail & Related papers (2020-02-20T03:13:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.