Related papers: DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

URL: http://arxiv.org/abs/2504.13227v1
Date: Thu, 17 Apr 2025 13:09:38 GMT
Title: DIDS: Domain Impact-aware Data Sampling for Large Language Model Training
Authors: Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Ruiyuan Zhang, Jiajie Xu, Jia Zhu, Hao Chen, Yao Zhao, Sirui Han, Xiaofang Zhou,
Abstract summary: We present Domain Impact-aware Data Sampling (DIDS) to optimize domain-level sampling strategies.<n>DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency.
Score: 41.86545248261005
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency.

Related papers

Weakly-Supervised Domain Adaptation with Proportion-Constrained Pseudo-Labeling [3.9146761527401424]
Domain shift is a significant challenge in machine learning, particularly in medical applications.<n>We propose a weakly-supervised domain adaptation method that leverages class proportion information from the target domain.<n>Our method assigns pseudo-labels to the unlabeled target data based on class proportion, improving performance without the need for additional annotations.
arXiv Detail & Related papers (2025-06-27T15:13:05Z)
GDO: Gradual Domain Osmosis [1.62060928868899]
We propose a new method called Gradual Domain Osmosis, which aims to solve the problem of smooth knowledge migration from source domain to target domain in Gradual Domain Adaptation (GDA) Traditional Gradual Domain Adaptation methods mitigate domain bias by introducing intermediate domains and self-training strategies, but often face the challenges of inefficient knowledge migration or missing data in intermediate domains.
arXiv Detail & Related papers (2025-01-31T14:25:45Z)
Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning [50.80758278865274]
In multi-domain learning, a single model is trained on diverse data domains to leverage shared knowledge and improve generalization.<n>The order in which the data from these domains is used for training can significantly affect the model's performance on each domain.<n>We investigate the influence of training order (or data mixing) in multi-domain learning using the concept of Lie bracket of gradient vector fields.
arXiv Detail & Related papers (2025-01-26T15:12:06Z)
Stratified Domain Adaptation: A Progressive Self-Training Approach for Scene Text Recognition [1.2878987353423252]
Unsupervised domain adaptation (UDA) has become increasingly prevalent in scene text recognition (STR) We introduce the Stratified Domain Adaptation (StrDA) approach, which examines the gradual escalation of the domain gap for the learning process. We propose a novel method for employing domain discriminators to estimate the out-of-distribution and domain discriminative levels of data samples.
arXiv Detail & Related papers (2024-10-13T16:40:48Z)
Style Adaptation for Domain-adaptive Semantic Segmentation [2.1365683052370046]
Domain discrepancy leads to a significant decrease in the performance of general network models trained on the source domain data when applied to the target domain. We introduce a straightforward approach to mitigate the domain discrepancy, which necessitates no additional parameter calculations and seamlessly integrates with self-training-based UDA methods. Our proposed method attains a noteworthy UDA performance of 76.93 mIoU on the GTA->Cityscapes dataset, representing a notable improvement of +1.03 percentage points over the previous state-of-the-art results.
arXiv Detail & Related papers (2024-04-25T02:51:55Z)
Unsupervised Domain Adaptation Using Compact Internal Representations [23.871860648919593]
A technique for tackling unsupervised domain adaptation involves mapping data points from both the source and target domains into a shared embedding space. We develop an additional technique which makes the internal distribution of the source domain more compact. We demonstrate that by increasing the margins between data representations for different classes in the embedding space, we can improve the model performance for UDA.
arXiv Detail & Related papers (2024-01-14T05:53:33Z)
DoGE: Domain Reweighting with Generalization Estimation [42.32000165235568]
We propose DOmain reweighting with Generalization Estimation (DoGE) In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
arXiv Detail & Related papers (2023-10-23T22:51:58Z)
IDA: Informed Domain Adaptive Semantic Segmentation [51.12107564372869]
We propose an Domain Informed Adaptation (IDA) model, a self-training framework that mixes the data based on class-level segmentation performance. In our IDA model, the class-level performance is tracked by an expected confidence score (ECS) and we then use a dynamic schedule to determine the mixing ratio for data in different domains. Our proposed method is able to outperform the state-of-the-art UDA-SS method by a margin of 1.1 mIoU in the adaptation of GTA-V to Cityscapes and of 0.9 mIoU in the adaptation of SYNTHIA to City
arXiv Detail & Related papers (2023-03-05T18:16:34Z)
Generalized Semantic Segmentation by Self-Supervised Source Domain Projection and Multi-Level Contrastive Learning [79.0660895390689]
Deep networks trained on the source domain show degraded performance when tested on unseen target domain data. We propose a Domain Projection and Contrastive Learning (DPCL) approach for generalized semantic segmentation.
arXiv Detail & Related papers (2023-03-03T13:07:14Z)
MADAv2: Advanced Multi-Anchor Based Active Domain Adaptation Segmentation [98.09845149258972]
We introduce active sample selection to assist domain adaptation regarding the semantic segmentation task. With only a little workload to manually annotate these samples, the distortion of the target-domain distribution can be effectively alleviated. A powerful semi-supervised domain adaptation strategy is proposed to alleviate the long-tail distribution problem.
arXiv Detail & Related papers (2023-01-18T07:55:22Z)
Multi-Source Domain Adaptation for Text Classification via DistanceNet-Bandits [101.68525259222164]
We present a study of various distance-based measures in the context of NLP tasks, that characterize the dissimilarity between domains based on sample estimates. We develop a DistanceNet model which uses these distance measures as an additional loss function to be minimized jointly with the task's loss function. We extend this model to a novel DistanceNet-Bandit model, which employs a multi-armed bandit controller to dynamically switch between multiple source domains.
arXiv Detail & Related papers (2020-01-13T15:53:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.