Quantifying Dataset Similarity to Guide Transfer Learning
- URL: http://arxiv.org/abs/2510.10866v2
- Date: Sat, 25 Oct 2025 04:27:59 GMT
- Title: Quantifying Dataset Similarity to Guide Transfer Learning
- Authors: Shudong Sun, Hao Helen Zhang,
- Abstract summary: Cross-Learning Score ( CLS) measures dataset similarity through bidirectional performance between domains.<n> CLS can reliably predict whether transfer will improve or degrade performance.<n> CLS is efficient and fast to compute as it bypasses the problem of expensive distribution estimation for high-dimensional problems.
- Score: 1.6328866317851185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transfer learning has become a cornerstone of modern machine learning, as it can empower models by leveraging knowledge from related domains to improve learning effectiveness. However, transferring from poorly aligned data can harm rather than help performance, making it crucial to determine whether the transfer will be beneficial before implementation. This work aims to address this challenge by proposing an innovative metric to measure dataset similarity and provide quantitative guidance on transferability. In the literature, existing methods largely focus on feature distributions while overlooking label information and predictive relationships, potentially missing critical transferability insights. In contrast, our proposed metric, the Cross-Learning Score (CLS), measures dataset similarity through bidirectional generalization performance between domains. We provide a theoretical justification for CLS by establishing its connection to the cosine similarity between the decision boundaries for the target and source datasets. Computationally, CLS is efficient and fast to compute as it bypasses the problem of expensive distribution estimation for high-dimensional problems. We further introduce a general framework that categorizes source datasets into positive, ambiguous, or negative transfer zones based on their CLS relative to the baseline error, enabling informed decisions. Additionally, we extend this approach to encoder-head architectures in deep learning to better reflect modern transfer pipelines. Extensive experiments on diverse synthetic and real-world tasks demonstrate that CLS can reliably predict whether transfer will improve or degrade performance, offering a principled tool for guiding data selection in transfer learning.
Related papers
- Task-Oriented Low-Label Semantic Communication With Self-Supervised Learning [67.06363342414397]
Task-oriented semantic communication enhances transmission efficiency by conveying semantic information rather than exact messages.<n>Deep learning (DL)-based semantic communication can effectively cultivate the essential semantic knowledge for semantic extraction, transmission, and interpretation.<n>We propose a self-supervised learning-based semantic communication framework (SLSCom) to enhance task inference performance.
arXiv Detail & Related papers (2025-05-26T13:06:18Z) - Wasserstein Transfer Learning [6.602088845993411]
We introduce a novel transfer learning framework for regression models whose outputs are probability distributions residing in the Wasserstein space.<n>We propose an estimator with provable convergence rates, quantifying the impact of domain similarity on transfer efficiency.<n>For cases where the informative subset is unknown, we develop a data-driven transfer learning procedure designed to mitigate negative transfer.
arXiv Detail & Related papers (2025-05-23T02:38:03Z) - Covariate-Elaborated Robust Partial Information Transfer with Conditional Spike-and-Slab Prior [1.111488407653005]
We propose a novel Bayesian transfer learning method named CONCERT'' to allow robust partial information transfer.
A conditional spike-and-slab prior is introduced in the joint distribution of target and source parameters for information transfer.
In contrast to existing work, the CONCERT is a one-step procedure, which achieves variable selection and information transfer simultaneously.
arXiv Detail & Related papers (2024-03-30T07:32:58Z) - Enhancing Information Maximization with Distance-Aware Contrastive
Learning for Source-Free Cross-Domain Few-Shot Learning [55.715623885418815]
Cross-Domain Few-Shot Learning methods require access to source domain data to train a model in the pre-training phase.
Due to increasing concerns about data privacy and the desire to reduce data transmission and training costs, it is necessary to develop a CDFSL solution without accessing source data.
This paper proposes an Enhanced Information Maximization with Distance-Aware Contrastive Learning method to address these challenges.
arXiv Detail & Related papers (2024-03-04T12:10:24Z) - On the Transferability of Learning Models for Semantic Segmentation for
Remote Sensing Data [12.500746892824338]
Recent deep learning-based methods outperform traditional learning methods on remote sensing (RS) semantic segmentation/classification tasks.
Yet, there is no comprehensive analysis of their transferability, i.e., to which extent a model trained on a source domain can be readily applicable to a target domain.
This paper investigates the raw transferability of traditional and deep learning (DL) models, as well as the effectiveness of domain adaptation (DA) approaches.
arXiv Detail & Related papers (2023-10-16T15:13:36Z) - Robust Transfer Learning with Unreliable Source Data [11.813197709246289]
We introduce a novel quantity called the ''ambiguity level'' that measures the discrepancy between the target and source regression functions.<n>We propose a simple transfer learning procedure, and establish a general theorem that shows how this new quantity is related to the transferability of learning.
arXiv Detail & Related papers (2023-10-06T21:50:21Z) - Bridged-GNN: Knowledge Bridge Learning for Effective Knowledge Transfer [65.42096702428347]
Graph Neural Networks (GNNs) aggregate information from neighboring nodes.
Knowledge Bridge Learning (KBL) learns a knowledge-enhanced posterior distribution for target domains.
Bridged-GNN includes an Adaptive Knowledge Retrieval module to build Bridged-Graph and a Graph Knowledge Transfer module.
arXiv Detail & Related papers (2023-08-18T12:14:51Z) - CosSGD: Nonlinear Quantization for Communication-efficient Federated
Learning [62.65937719264881]
Federated learning facilitates learning across clients without transferring local data on these clients to a central server.
We propose a nonlinear quantization for compressed gradient descent, which can be easily utilized in federated learning.
Our system significantly reduces the communication cost by up to three orders of magnitude, while maintaining convergence and accuracy of the training process.
arXiv Detail & Related papers (2020-12-15T12:20:28Z) - Towards Accurate Knowledge Transfer via Target-awareness Representation
Disentanglement [56.40587594647692]
We propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED)
TRED disentangles the relevant knowledge with respect to the target task from the original source model and used as a regularizer during fine-tuning the target model.
Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average.
arXiv Detail & Related papers (2020-10-16T17:45:08Z) - Uniform Priors for Data-Efficient Transfer [65.086680950871]
We show that features that are most transferable have high uniformity in the embedding space.
We evaluate the regularization on its ability to facilitate adaptation to unseen tasks and data.
arXiv Detail & Related papers (2020-06-30T04:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.