On the Complementarity of Data Selection and Fine Tuning for Domain
Adaptation
- URL: http://arxiv.org/abs/2109.07591v1
- Date: Wed, 15 Sep 2021 21:49:06 GMT
- Title: On the Complementarity of Data Selection and Fine Tuning for Domain
Adaptation
- Authors: Dan Iter and David Grangier
- Abstract summary: Domain adaptation of neural networks commonly relies on three training phases: pretraining, selected data training and then fine tuning.
Data selection improves target domain generalization by training further on pretraining data identified by relying on a small sample of target domain data.
- Score: 22.178874891042994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Domain adaptation of neural networks commonly relies on three training
phases: pretraining, selected data training and then fine tuning. Data
selection improves target domain generalization by training further on
pretraining data identified by relying on a small sample of target domain data.
This work examines the benefit of data selection for language modeling and
machine translation. Our experiments assess the complementarity of selection
with fine tuning and result in practical recommendations: (i) selected data
must be similar to the fine-tuning domain but not so much as to erode the
complementary effect of fine-tuning; (ii) there is a trade-off between
selecting little data for fast but limited progress or much data for slow but
long lasting progress; (iii) data selection can be applied early during
pretraining, with performance gains comparable to long pretraining session;
(iv) data selection from domain classifiers is often more effective than the
popular contrastive data selection method.
Related papers
- Compute-Constrained Data Selection [77.06528009072967]
We formalize the problem of data selection with a cost-aware utility function, and model the problem as trading off initial-selection cost for training gain.
We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute.
arXiv Detail & Related papers (2024-10-21T17:11:21Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison [9.324894567200582]
We systematically study preference datasets through three perspectives: scale, label noise, and information content.
Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.
arXiv Detail & Related papers (2024-09-15T03:55:03Z) - MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [16.654859430784825]
Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining.
We introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress.
Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks.
arXiv Detail & Related papers (2024-06-10T06:27:42Z) - TextGram: Towards a better domain-adaptive pretraining [0.3769303106863454]
In NLP, pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks.
We propose our own domain-adaptive data selection method - TextGram.
We show that the proposed strategy works better compared to other selection methods.
arXiv Detail & Related papers (2024-04-28T15:44:57Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Examining the Effect of Pre-training on Time Series Classification [21.38211396933795]
This study investigates the impact of pre-training followed by fine-tuning on the fine-tuning process.
We conducted a thorough examination of 150 classification datasets.
We find that pre-training can only help improve the optimization process for models that fit the data poorly.
Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume.
arXiv Detail & Related papers (2023-09-11T06:26:57Z) - Analyzing domain shift when using additional data for the MICCAI KiTS23
Challenge [5.745796568988237]
We study techniques which ameliorate domain shift during training so that the additional data becomes better usable for preprocessing and training together with the original data.
Our results show that transforming the additional data using histogram matching has better results than using simple normalization.
arXiv Detail & Related papers (2023-09-05T07:31:22Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.