On the Power of Source Screening for Learning Shared Feature Extractors
- URL: http://arxiv.org/abs/2602.16125v1
- Date: Wed, 18 Feb 2026 01:32:10 GMT
- Title: On the Power of Source Screening for Learning Shared Feature Extractors
- Authors: Leo, Wang, Connor Mclaughlin, Lili Su,
- Abstract summary: It is well understood that data sources with low relevance or poor quality may hinder representation learning.<n>We focus on the question of which data sources should be learned jointly by focusing on the traditionally deemed good'' collection of sources.<n>We find that source screening can play a central role in statistically optimal subspace estimation.
- Score: 33.10812756558517
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good'' collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.
Related papers
- Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z) - SourceSplice: Source Selection for Machine Learning Tasks [3.3916160303055563]
Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks.<n>This paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset.<n>We propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources.
arXiv Detail & Related papers (2025-07-29T19:29:52Z) - Combining inherent knowledge of vision-language models with unsupervised domain adaptation through strong-weak guidance [44.1830188215271]
Unsupervised domain adaptation (UDA) tries to overcome the tedious work of labeling data by leveraging a labeled source dataset.<n>Current vision-language models exhibit remarkable zero-shot prediction capabilities.<n>We introduce a strong-weak guidance learning scheme that employs zero-shot predictions to help align the source and target dataset.
arXiv Detail & Related papers (2023-12-07T06:16:39Z) - An Adaptive Kernel Approach to Federated Learning of Heterogeneous
Causal Effects [10.248235276871256]
We propose a new causal inference framework to learn causal effects from multiple, decentralized data sources.
We introduce an adaptive transfer algorithm that learns the similarities among the data sources.
The proposed method is empirically shown to outperform the baselines on decentralized data sources with dissimilar distributions.
arXiv Detail & Related papers (2023-01-01T04:57:48Z) - Multi-View Independent Component Analysis with Shared and Individual
Sources [0.0]
Independent component analysis (ICA) is a blind source separation method for linear disentanglement of independent latent sources from observed data.
We prove that the corresponding linear structure is identifiable, and the shared sources can be recovered, provided that sufficiently many diverse views and data points are available.
We show empirically that our objective recovers the sources in high dimensional settings, also in the case when the measurements are corrupted by noise.
arXiv Detail & Related papers (2022-10-05T08:23:05Z) - Heterogeneous Target Speech Separation [52.05046029743995]
We introduce a new paradigm for single-channel target source separation where the sources of interest can be distinguished using non-mutually exclusive concepts.
Our proposed heterogeneous separation framework can seamlessly leverage datasets with large distribution shifts.
arXiv Detail & Related papers (2022-04-07T17:14:20Z) - Deep Transfer Learning for Multi-source Entity Linkage via Domain
Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching.
AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage.
Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z) - Learning Bias-Invariant Representation by Cross-Sample Mutual
Information Minimization [77.8735802150511]
We propose a cross-sample adversarial debiasing (CSAD) method to remove the bias information misused by the target task.
The correlation measurement plays a critical role in adversarial debiasing and is conducted by a cross-sample neural mutual information estimator.
We conduct thorough experiments on publicly available datasets to validate the advantages of the proposed method over state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-11T21:17:02Z) - "Don't quote me on that": Finding Mixtures of Sources in News Articles [85.92467549469147]
We construct an ontological labeling system for sources based on each source's textitaffiliation and textitrole
We build a probabilistic model to infer these attributes for named sources and to describe news articles as mixtures of these sources.
arXiv Detail & Related papers (2021-04-19T21:57:11Z) - Unsupervised Multi-source Domain Adaptation Without Access to Source
Data [58.551861130011886]
Unsupervised Domain Adaptation (UDA) aims to learn a predictor model for an unlabeled domain by transferring knowledge from a separate labeled source domain.
We propose a novel and efficient algorithm which automatically combines the source models with suitable weights in such a way that it performs at least as good as the best source model.
arXiv Detail & Related papers (2021-04-05T10:45:12Z) - InSRL: A Multi-view Learning Framework Fusing Multiple Information
Sources for Distantly-supervised Relation Extraction [19.176183245280267]
We introduce two widely-existing sources in knowledge bases, namely entity descriptions and multi-grained entity types.
An end-to-end multi-view learning framework is proposed for relation extraction via Intact Space Representation Learning (InSRL)
arXiv Detail & Related papers (2020-12-17T02:49:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.