Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
- URL: http://arxiv.org/abs/2507.15640v1
- Date: Mon, 21 Jul 2025 14:01:54 GMT
- Title: Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
- Authors: Kailai Yang, Xiao Liu, Lei Ji, Hao Li, Yeyun Gong, Peng Cheng, Mao Yang,
- Abstract summary: Data Mixing Agent is an end-to-end framework that learns to re-weight domains.<n>It generalizes well across unseen source fields, target models, and domain spaces without retraining.
- Score: 30.915768238214653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents' well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.
Related papers
- Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training [4.90288999217624]
We introduce a framework for optimizing domain-specific dataset construction in foundation model training.<n>Our approach extends the usual point estimate approaches, aka micro-annealing, to estimating scaling laws.<n>We validate our approach through experiments on a pre-trained model with 7 billion parameters.
arXiv Detail & Related papers (2025-07-29T21:56:45Z) - Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback [37.06543502352577]
We propose a reinforcement learning-based finetuning framework to enhance the domain adaptability of foundation models for Data2Eqn tasks.<n>Our method allows the model to adapt to specific and complex data distributions and generate mathematically meaningful equations.
arXiv Detail & Related papers (2025-05-21T14:25:41Z) - Similarity-Based Domain Adaptation with LLMs [13.692329347889212]
Unsupervised domain adaptation leverages abundant labeled data from various source domains to generalize onto unlabeled target data.<n>This paper introduces a simple framework that utilizes the impressive generalization capabilities of Large Language Models (LLMs) for target data annotation.<n>Our framework achieves impressive performance, specifically, 2.44% accuracy improvement when compared to the SOTA method.
arXiv Detail & Related papers (2025-03-07T09:51:07Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.<n>Existing approaches require re-training models on different data subsets, which is computationally intensive.<n>This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data.
One key challenge in federated learning is to handle non-identically distributed data across the clients.
We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - DoGE: Domain Reweighting with Generalization Estimation [42.32000165235568]
We propose DOmain reweighting with Generalization Estimation (DoGE)
In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture.
DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
arXiv Detail & Related papers (2023-10-23T22:51:58Z) - Distributionally Robust Learning for Multi-source Unsupervised Domain Adaptation [9.359714425373616]
Empirical risk often performs poorly when the distribution of the target domain differs from those of source domains.<n>We develop an unsupervised domain adaptation approach that leverages labeled data from multiple source domains and unlabeled data from the target domain.
arXiv Detail & Related papers (2023-09-05T13:19:40Z) - Self-balanced Learning For Domain Generalization [64.99791119112503]
Domain generalization aims to learn a prediction model on multi-domain source data such that the model can generalize to a target domain with unknown statistics.
Most existing approaches have been developed under the assumption that the source data is well-balanced in terms of both domain and class.
We propose a self-balanced domain generalization framework that adaptively learns the weights of losses to alleviate the bias caused by different distributions of the multi-domain source data.
arXiv Detail & Related papers (2021-08-31T03:17:54Z) - Gradual Domain Adaptation via Self-Training of Auxiliary Models [50.63206102072175]
Domain adaptation becomes more challenging with increasing gaps between source and target domains.
We propose self-training of auxiliary models (AuxSelfTrain) that learns models for intermediate domains.
Experiments on benchmark datasets of unsupervised and semi-supervised domain adaptation verify its efficacy.
arXiv Detail & Related papers (2021-06-18T03:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.