INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of
Language Models
- URL: http://arxiv.org/abs/2305.06677v2
- Date: Thu, 19 Oct 2023 19:55:20 GMT
- Title: INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of
Language Models
- Authors: H S V N S Kowndinya Renduchintala, Krishnateja Killamsetty, Sumit
Bhatia, Milan Aggarwal, Ganesh Ramakrishnan, Rishabh Iyer, Balaji
Krishnamurthy
- Abstract summary: We show how we can employ submodular optimization to select highly representative subsets of the training corpora.
We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
- Score: 40.54353850357839
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A salient characteristic of pre-trained language models (PTLMs) is a
remarkable improvement in their generalization capability and emergence of new
capabilities with increasing model capacity and pre-training dataset size.
Consequently, we are witnessing the development of enormous models pushing the
state-of-the-art. It is, however, imperative to realize that this inevitably
leads to prohibitively long training times, extortionate computing costs, and a
detrimental environmental impact. Significant efforts are underway to make PTLM
training more efficient through innovations in model architectures, training
pipelines, and loss function design, with scant attention being paid to
optimizing the utility of training data. The key question that we ask is
whether it is possible to train PTLMs by employing only highly informative
subsets of the training data while maintaining downstream performance? Building
upon the recent progress in informative data subset selection, we show how we
can employ submodular optimization to select highly representative subsets of
the training corpora and demonstrate that the proposed framework can be applied
to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a
fraction of data. Further, we perform a rigorous empirical evaluation to show
that the resulting models achieve up to $\sim99\%$ of the performance of the
fully-trained models. We made our framework publicly available at
https://github.com/Efficient-AI/ingenious.
Related papers
- Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA [15.542668474378633]
We propose a novel and efficient machine unlearning method on pre-trained models.
We leverage LoRA to decompose the model's intermediate features into pre-trained features and residual features.
The method aims to learn the zero residuals on the retained set and shifted residuals on the unlearning set.
arXiv Detail & Related papers (2024-11-13T08:56:35Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - POINTS: Improving Your Vision-language Model with Affordable Strategies [28.611705477757454]
We train a robust baseline model using latest advancements in vision-language models.
We filter pre-training data using perplexity, selecting the lowest perplexity data for training.
During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements.
arXiv Detail & Related papers (2024-09-07T13:41:37Z) - Rethinking Overlooked Aspects in Vision-Language Models [32.525916879333145]
Recent advancements in vision-language models (LVLMs) have been substantial.
Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance.
This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets.
arXiv Detail & Related papers (2024-05-20T07:53:41Z) - Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping [53.454408491386886]
bootstrapping self-alignment markedly surpasses the single-round approach.
We propose Step-On-Feet Tuning (SOFT) which leverages model's continuously enhanced few-shot ability to boost zero or one-shot performance.
Based on easy-to-hard training recipe, we propose SOFT+ which further boost self-alignment's performance.
arXiv Detail & Related papers (2024-02-12T12:30:42Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - BERT WEAVER: Using WEight AVERaging to enable lifelong learning for
transformer-based models in biomedical semantic search engines [49.75878234192369]
We present WEAVER, a simple, yet efficient post-processing method that infuses old knowledge into the new model.
We show that applying WEAVER in a sequential manner results in similar word embedding distributions as doing a combined training on all data at once.
arXiv Detail & Related papers (2022-02-21T10:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.