DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks
- URL: http://arxiv.org/abs/2502.00270v2
- Date: Sun, 18 May 2025 13:39:08 GMT
- Title: DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks
- Authors: Zhiliang Chen, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Bryan Kian Hsiang Low,
- Abstract summary: Our paper presents a novel global-to-local algorithm that interleaves influence function as a data selection method with Bayesian optimization to optimize data mixture via feedback from a specific unseen evaluation task.<n>By analyzing DUET's cumulative regret, we theoretically show that DUET converges to the optimal training data mixture for an unseen task even without any data knowledge of the task.
- Score: 40.91931801667421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of an LLM depends heavily on the relevance of its training data to the downstream evaluation task. However, in practice, the data involved in an unseen evaluation task is often unknown (e.g., conversations between an LLM and a user are end-to-end encrypted). Hence, it is unclear what data are relevant for fine-tuning the LLM to maximize its performance on the specific unseen evaluation task. Instead, one can only deploy the LLM on the unseen task to gather multiple rounds of feedback on how well the model performs (e.g., user ratings). This novel setting offers a refreshing perspective towards optimizing training data mixtures via feedback from an unseen evaluation task, which prior data mixing and selection works do not consider. Our paper presents DUET, a novel global-to-local algorithm that interleaves influence function as a data selection method with Bayesian optimization to optimize data mixture via feedback from a specific unseen evaluation task. By analyzing DUET's cumulative regret, we theoretically show that DUET converges to the optimal training data mixture for an unseen task even without any data knowledge of the task. Finally, our experiments across a variety of language tasks demonstrate that DUET outperforms existing data selection and mixing methods in the unseen-task setting.
Related papers
- Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models [24.396525123797073]
We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture.
We use this approximation as a source of additional features in a regression model, trained from observations of model loss for a small number of mixtures.
arXiv Detail & Related papers (2025-02-21T21:27:48Z) - A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z) - Learning to Verify Summary Facts with Fine-Grained LLM Feedback [15.007479147796403]
Training automatic summary fact verifiers often faces the challenge of a lack of human-labeled data.<n>We introduce FineSumFact, a large-scale dataset containing fine-grained factual feedback on summaries.
arXiv Detail & Related papers (2024-12-14T05:28:44Z) - ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning [29.001249598245]
We introduce Reward-Oriented inStruction data sElection to optimize data selection for task-specific instruction tuning.<n> ROSE adapts an influence formulation to approximate the influence of training data points relative to a few-shot preference validation set to select the most task-related training data points.
arXiv Detail & Related papers (2024-12-01T01:01:09Z) - Scalable Data Ablation Approximations for Language Models through Modular Training and Merging [27.445079398772904]
We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus.
We find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data.
arXiv Detail & Related papers (2024-10-21T06:03:49Z) - When to Trust Your Data: Enhancing Dyna-Style Model-Based Reinforcement Learning With Data Filter [7.886307329450978]
Dyna-style algorithms combine two approaches by using simulated data from an estimated environmental model to accelerate model-free training.
Previous works address this issue by using model ensembles or pretraining the estimated model with data collected from the real environment.
We introduce an out-of-distribution data filter that removes simulated data from the estimated model that significantly diverges from data collected in the real environment.
arXiv Detail & Related papers (2024-10-16T01:49:03Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Fairness-Aware Data Valuation for Supervised Learning [4.874780144224057]
We propose Fairness-Aware Data vauatiOn (FADO) to incorporate fairness concerns into a series of ML-related tasks.
We show how FADO can be applied as the basis for unfairness mitigation pre-processing techniques.
Our methods achieve promising results -- up to a 40 p.p. improvement in fairness at a less than 1 p.p. loss in performance compared to a baseline.
arXiv Detail & Related papers (2023-03-29T18:51:13Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.