Related papers: Efficient NLP Model Finetuning via Multistage Data Filtering

Efficient NLP Model Finetuning via Multistage Data Filtering

URL: http://arxiv.org/abs/2207.14386v2
Date: Fri, 19 May 2023 02:40:55 GMT
Title: Efficient NLP Model Finetuning via Multistage Data Filtering
Authors: Xu Ouyang, Shahina Mohd Azam Ansari, Felix Xiaozhu Lin, Yangfeng Ji
Abstract summary: We set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes. Our method reduces the required training examples by up to 5.3$times$ and training time by up to 6.8$times$, while only seeing minor accuracy degradation.
Score: 11.058786955754004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As model finetuning is central to the modern NLP, we set to maximize its efficiency. Motivated by redundancy in training examples and the sheer sizes of pretrained models, we exploit a key opportunity: training only on important data. To this end, we set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are two: (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes. We integrate the above techniques in a holistic, three-stage training process. On a diverse set of benchmarks, our method reduces the required training examples by up to 5.3$\times$ and training time by up to 6.8$\times$, while only seeing minor accuracy degradation. Our method is effective even when training one epoch, where each training example is encountered only once. It is simple to implement and is compatible with the existing finetuning techniques. Code is available at: https://github.com/xo28/efficient- NLP-multistage-training

Related papers

Optimizing ML Training with Metagradient Descent [69.89631748402377]
We introduce an algorithm for efficiently calculating metagradients -- gradients through model training -- at scale. We then introduce a "smooth model training" framework that enables effective optimization using metagradients.
arXiv Detail & Related papers (2025-03-17T22:18:24Z)
Beyond Next Token Prediction: Patch-Level Training for Large Language Models [69.67438563485887]
We introduce patch-level training for Large Language Models (LLMs)<n>During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.<n>We show that patch-level training can reduce the overall training costs to 0.5$times$, without compromising the model performance.
arXiv Detail & Related papers (2024-07-17T15:48:39Z)
Boosting Meta-Training with Base Class Information for Few-Shot Learning [35.144099160883606]
We propose an end-to-end training paradigm consisting of two alternative loops. In the outer loop, we calculate cross entropy loss on the entire training set while updating only the final linear layer. This training paradigm not only converges quickly but also outperforms existing baselines, indicating that information from the overall training set and the meta-learning training paradigm could mutually reinforce one another.
arXiv Detail & Related papers (2024-03-06T05:13:23Z)
Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z)
SwiftLearn: A Data-Efficient Training Method of Deep Learning Models using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models. This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages. We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z)
Fast Propagation is Better: Accelerating Single-Step Adversarial Training via Sampling Subnetworks [69.54774045493227]
A drawback of adversarial training is the computational overhead introduced by the generation of adversarial examples. We propose to exploit the interior building blocks of the model to improve efficiency. Compared with previous methods, our method not only reduces the training cost but also achieves better model robustness.
arXiv Detail & Related papers (2023-10-24T01:36:20Z)
FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics [7.58472343957521]
We show that training dynamics are highly transferable across model sizes and pre-training methods. We propose a novel fine-tuning approach: Fine-Tuning by transFerring Training dynamics (FTFT)
arXiv Detail & Related papers (2023-10-10T12:53:48Z)
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks. Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z)
LiST: Lite Self-training Makes Efficient Few-shot Learners [91.28065455714018]
LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.
arXiv Detail & Related papers (2021-10-12T18:47:18Z)
Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper. Our method makes use of information from both intra- and inter-images. It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z)
A Practical Incremental Method to Train Deep CTR Models [37.54660958085938]
We introduce a practical incremental method to train deep CTR models, which consists of three decoupled modules. Our method can achieve comparable performance to the conventional batch mode training with much better training efficiency.
arXiv Detail & Related papers (2020-09-04T12:35:42Z)
A Novel DNN Training Framework via Data Sampling and Multi-Task Optimization [7.001799696806368]
We propose a novel framework to train DNN models. It generates multiple pairs of training and validation sets from the gross training set via random splitting. It outputs the best, among all trained models, which has the overall best performance across the validation sets from all pairs.
arXiv Detail & Related papers (2020-07-02T10:58:57Z)
The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit" We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.