NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification
Tasks
- URL: http://arxiv.org/abs/2306.03208v1
- Date: Mon, 5 Jun 2023 19:30:41 GMT
- Title: NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification
Tasks
- Authors: Jean-Michel Attendu and Jean-Philippe Corbeil
- Abstract summary: Finetuning large language models inflates the costs of NLU applications.
Recent works in computer vision use data pruning to reduce training time.
We propose a curriculum which periodically scores and discards unimportant examples during finetuning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Finetuning large language models inflates the costs of NLU applications and
remains the bottleneck of development cycles. Recent works in computer vision
use data pruning to reduce training time. Pruned data selection with static
methods is based on a score calculated for each training example prior to
finetuning, which involves important computational overhead. Moreover, the
score may not necessarily be representative of sample importance throughout the
entire training duration. We propose to address these issues with a refined
version of dynamic data pruning, a curriculum which periodically scores and
discards unimportant examples during finetuning. Our method leverages an EL2N
metric that we extend to the joint intent and slot classification task, and an
initial finetuning phase on the full train set. Our results on the GLUE
benchmark and four joint NLU datasets show a better time-accuracy trade-off
compared to static methods. Our method preserves full accuracy while training
on 50% of the data points and reduces computational times by up to 41%. If we
tolerate instead a minor drop of accuracy of 1%, we can prune 80% of the
training examples for a reduction in finetuning time reaching 66%.
Related papers
- GDeR: Safeguarding Efficiency, Balancing, and Robustness via Prototypical Graph Pruning [44.401418612374286]
We introduce a novel soft-pruning method, GDeR, designed to update the training during the process using trainable prototypes.
GDeR achieves or surpasses the performance of the full dataset with 30%50% fewer training samples.
It also outperforms state-of-the-art pruning methods in imbalanced training and noisy training scenarios.
arXiv Detail & Related papers (2024-10-17T16:56:01Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - SwiftLearn: A Data-Efficient Training Method of Deep Learning Models
using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models.
This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages.
We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks.
We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process.
Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Dataset Pruning: Reducing Training Data by Examining Generalization
Influence [30.30255670341501]
Do all training data contribute to model's performance?
How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
arXiv Detail & Related papers (2022-05-19T05:36:35Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Fine-Tuning Pretrained Language Models: Weight Initializations, Data
Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing.
We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds.
We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.