Analysing The Impact of Sequence Composition on Language Model
Pre-Training
- URL: http://arxiv.org/abs/2402.13991v1
- Date: Wed, 21 Feb 2024 18:23:16 GMT
- Title: Analysing The Impact of Sequence Composition on Language Model
Pre-Training
- Authors: Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu,
Piotr Mi{\l}o\'s, Yuxiang Wu, Pasquale Minervini
- Abstract summary: We study the influence of the pre-training sequence composition strategy on the generalisation properties of the model.
Applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training.
In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document.
- Score: 20.929800523719187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most language model pre-training frameworks concatenate multiple documents
into fixed-length sequences and use causal masking to compute the likelihood of
each token given its context; this strategy is widely adopted due to its
simplicity and efficiency. However, to this day, the influence of the
pre-training sequence composition strategy on the generalisation properties of
the model remains under-explored. In this work, we find that applying causal
masking can lead to the inclusion of distracting information from previous
documents during pre-training, which negatively impacts the performance of the
models on language modelling and downstream tasks. In intra-document causal
masking, the likelihood of each token is only conditioned on the previous
tokens in the same document, eliminating potential distracting information from
previous documents and significantly improving performance. Furthermore, we
find that concatenating related documents can reduce some potential
distractions during pre-training, and our proposed efficient retrieval-based
sequence construction method, BM25Chunk, can improve in-context learning
(+11.6\%), knowledge memorisation (+9.8\%), and context utilisation (+7.2\%)
abilities of language models without sacrificing efficiency.
Related papers
- Scalable Influence and Fact Tracing for Large Language Model Pretraining [14.598556308631018]
Training data attribution (TDA) methods aim to attribute model outputs back to specific training examples.
This paper refines existing gradient-based methods to work effectively at scale.
arXiv Detail & Related papers (2024-10-22T20:39:21Z) - Manual Verbalizer Enrichment for Few-Shot Text Classification [1.860409237919611]
acrshortmave is an approach for verbalizer construction by enrichment of class labels.
Our model achieves state-of-the-art results while using significantly fewer resources.
arXiv Detail & Related papers (2024-10-08T16:16:47Z) - CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection [30.46562066023117]
We propose a novel method utilizing attributes in vision-language foundation models for incremental object detection.
Our method constructs a Class-Agnostic Shared Attribute base (CASA) to capture common semantic information among incremental classes.
Our method adds only 0.7% to parameter storage through parameter-efficient fine-tuning to significantly enhance the scalability and adaptability of our proposed method.
arXiv Detail & Related papers (2024-10-08T08:36:12Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - Causal Document-Grounded Dialogue Pre-training [81.16429056652483]
We present a causally-complete dataset construction strategy for building million-level DocGD pre-training corpora.
Experiments on three benchmark datasets demonstrate that our causal pre-training achieves considerable and consistent improvements under fully-supervised, low-resource, few-shot, and zero-shot settings.
arXiv Detail & Related papers (2023-05-18T12:39:25Z) - Revisiting text decomposition methods for NLI-based factuality scoring
of summaries [9.044665059626958]
We show that fine-grained decomposition is not always a winning strategy for factuality scoring.
We also show that small changes to previously proposed entailment-based scoring methods can result in better performance.
arXiv Detail & Related papers (2022-11-30T09:54:37Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Pre-training via Paraphrasing [96.79972492585112]
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual paraphrasing objective.
We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization.
For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation.
arXiv Detail & Related papers (2020-06-26T14:43:43Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.