A Compact Pretraining Approach for Neural Language Models
- URL: http://arxiv.org/abs/2208.12367v2
- Date: Mon, 29 Aug 2022 00:54:42 GMT
- Title: A Compact Pretraining Approach for Neural Language Models
- Authors: Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, Ata Kiapour
- Abstract summary: We show that pretrained NLMs learn in-domain information more effectively and faster from a compact subset of the data.
We construct these compact subsets from the unstructured data using a combination of abstractive summaries and extractive keywords.
Our strategy reduces pretraining time by up to five times compared to vanilla pretraining.
- Score: 21.767174489837828
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Domain adaptation for large neural language models (NLMs) is coupled with
massive amounts of unstructured data in the pretraining phase. In this study,
however, we show that pretrained NLMs learn in-domain information more
effectively and faster from a compact subset of the data that focuses on the
key information in the domain. We construct these compact subsets from the
unstructured data using a combination of abstractive summaries and extractive
keywords. In particular, we rely on BART to generate abstractive summaries, and
KeyBERT to extract keywords from these summaries (or the original unstructured
text directly). We evaluate our approach using six different settings: three
datasets combined with two distinct NLMs. Our results reveal that the
task-specific classifiers trained on top of NLMs pretrained using our method
outperform methods based on traditional pretraining, i.e., random masking on
the entire data, as well as methods without pretraining. Further, we show that
our strategy reduces pretraining time by up to five times compared to vanilla
pretraining. The code for all of our experiments is publicly available at
https://github.com/shahriargolchin/compact-pretraining.
Related papers
- Bucket Pre-training is All You Need [9.332544709626875]
Large language models (LLMs) have demonstrated exceptional performance across various natural language processing tasks.
The conventional fixed-length data composition strategy for pretraining, which involves concatenating and splitting documents, can introduce noise and limit the model's ability to capture long-range dependencies.
We propose a multi-bucket data composition method that moves beyond the fixed-length paradigm, offering a more flexible and efficient approach to pretraining.
arXiv Detail & Related papers (2024-07-10T09:27:23Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - Towards Efficient Active Learning in NLP via Pretrained Representations [1.90365714903665]
Fine-tuning Large Language Models (LLMs) is now a common approach for text classification in a wide range of applications.
We drastically expedite this process by using pretrained representations of LLMs within the active learning loop.
Our strategy yields similar performance to fine-tuning all the way through the active learning loop but is orders of magnitude less computationally expensive.
arXiv Detail & Related papers (2024-02-23T21:28:59Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - ReGen: Zero-Shot Text Classification via Training Data Generation with
Progressive Dense Retrieval [22.882301169283323]
We propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus.
Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models.
arXiv Detail & Related papers (2023-05-18T04:30:09Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Beyond prompting: Making Pre-trained Language Models Better Zero-shot
Learners by Clustering Representations [24.3378487252621]
We show that zero-shot text classification can be improved simply by clustering texts in the embedding spaces of pre-trained language models.
Our approach achieves an average of 20% absolute improvement over prompt-based zero-shot learning.
arXiv Detail & Related papers (2022-10-29T16:01:51Z) - Towards General and Efficient Active Learning [20.888364610175987]
Active learning aims to select the most informative samples to exploit limited annotation budgets.
We propose a novel general and efficient active learning (GEAL) method in this paper.
Our method can conduct data selection processes on different datasets with a single-pass inference of the same model.
arXiv Detail & Related papers (2021-12-15T08:35:28Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Neural Semi-supervised Learning for Text Classification Under
Large-Scale Pretraining [51.19885385587916]
We conduct studies on semi-supervised learning in the task of text classification under the context of large-scale LM pretraining.
Our work marks an initial step in understanding the behavior of semi-supervised learning models under the context of large-scale pretraining.
arXiv Detail & Related papers (2020-11-17T13:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.