CTRL Your Shift: Clustered Transfer Residual Learning for Many Small Datasets
- URL: http://arxiv.org/abs/2508.11144v1
- Date: Fri, 15 Aug 2025 01:27:17 GMT
- Title: CTRL Your Shift: Clustered Transfer Residual Learning for Many Small Datasets
- Authors: Gauri Jain, Dominik Rothenhäusler, Kirk Bansak, Elisabeth Paulson,
- Abstract summary: Clustered Transfer Residual Learning (CTRL) is a meta-learning method that combines the strengths of cross-domain residual learning and adaptive pooling/clustering.<n>We provide theoretical results that clarify how our objective navigates the trade-off between data quantity and data quality.
- Score: 1.7624347338410744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML) tasks often utilize large-scale data that is drawn from several distinct sources, such as different locations, treatment arms, or groups. In such settings, practitioners often desire predictions that not only exhibit good overall accuracy, but also remain reliable within each source and preserve the differences that matter across sources. For instance, several asylum and refugee resettlement programs now use ML-based employment predictions to guide where newly arriving families are placed within a host country, which requires generating informative and differentiated predictions for many and often small source locations. However, this task is made challenging by several common characteristics of the data in these settings: the presence of numerous distinct data sources, distributional shifts between them, and substantial variation in sample sizes across sources. This paper introduces Clustered Transfer Residual Learning (CTRL), a meta-learning method that combines the strengths of cross-domain residual learning and adaptive pooling/clustering in order to simultaneously improve overall accuracy and preserve source-level heterogeneity. We provide theoretical results that clarify how our objective navigates the trade-off between data quantity and data quality. We evaluate CTRL alongside other state-of-the-art benchmarks on 5 large-scale datasets. This includes a dataset from the national asylum program in Switzerland, where the algorithmic geographic assignment of asylum seekers is currently being piloted. CTRL consistently outperforms the benchmarks across several key metrics and when using a range of different base learners.
Related papers
- DataS^3: Dataset Subset Selection for Specialization [60.589117206895125]
We introduce DataS3, the first dataset and benchmark designed specifically for the DS3 problem.<n>DataS3 encompasses diverse real-world application domains, each with a set of distinct deployments to specialize in.<n>We demonstrate the existence of manually curated (deployment-specific) expert subsets that outperform training on all available data with accuracy gains up to 51.3 percent.
arXiv Detail & Related papers (2025-04-22T21:25:14Z) - Curriculum Learning with Quality-Driven Data Selection [6.794629387975326]
OpenAI's GPT-4 has generated significant interest in the development of Multimodal Large Language Models (MLLMs)<n>We propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality.
arXiv Detail & Related papers (2024-06-27T07:20:36Z) - Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts [11.562953837452126]
We make the first attempt to assess the informativeness of local data derived from diverse domains.
We propose a novel methodology termed Federated Evidential Active Learning (FEAL) to calibrate the data evaluation under domain shift.
arXiv Detail & Related papers (2023-12-05T08:32:27Z) - Harnessing Administrative Data Inventories to Create a Reliable
Transnational Reference Database for Crop Type Monitoring [0.0]
We showcase E URO C ROPS, a reference dataset for crop type classification that aggregates and harmonizes administrative data surveyed in different countries with the goal of transnational interoperability.
arXiv Detail & Related papers (2023-10-10T07:57:00Z) - Data Quality in Imitation Learning [15.939363481618738]
In offline learning for robotics, we simply lack internet scale data, and so high quality datasets are a necessity.
This is especially true in imitation learning (IL), a sample efficient paradigm for robot learning using expert demonstrations.
In this work, we take the first step toward formalizing data quality for imitation learning through the lens of distribution shift.
arXiv Detail & Related papers (2023-06-04T18:48:32Z) - Deep Transfer Learning for Multi-source Entity Linkage via Domain
Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching.
AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage.
Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z) - Exploiting Shared Representations for Personalized Federated Learning [54.65133770989836]
We propose a novel federated learning framework and algorithm for learning a shared data representation across clients and unique local heads for each client.
Our algorithm harnesses the distributed computational power across clients to perform many local-updates with respect to the low-dimensional local parameters for every update of the representation.
This result is of interest beyond federated learning to a broad class of problems in which we aim to learn a shared low-dimensional representation among data distributions.
arXiv Detail & Related papers (2021-02-14T05:36:25Z) - WILDS: A Benchmark of in-the-Wild Distribution Shifts [157.53410583509924]
Distribution shifts can substantially degrade the accuracy of machine learning systems deployed in the wild.
We present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts.
We show that standard training results in substantially lower out-of-distribution than in-distribution performance.
arXiv Detail & Related papers (2020-12-14T11:14:56Z) - Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML.
We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML.
Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.