Related papers: Analysing The Impact of Sequence Composition on Language Model Pre-Training

Analysing The Impact of Sequence Composition on Language Model Pre-Training

URL: http://arxiv.org/abs/2402.13991v1
Date: Wed, 21 Feb 2024 18:23:16 GMT
Title: Analysing The Impact of Sequence Composition on Language Model Pre-Training
Authors: Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu, Piotr Mi{\l}o\'s, Yuxiang Wu, Pasquale Minervini
Abstract summary: We study the influence of the pre-training sequence composition strategy on the generalisation properties of the model. Applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training. In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document.
Score: 20.929800523719187
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most language model pre-training frameworks concatenate multiple documents into fixed-length sequences and use causal masking to compute the likelihood of each token given its context; this strategy is widely adopted due to its simplicity and efficiency. However, to this day, the influence of the pre-training sequence composition strategy on the generalisation properties of the model remains under-explored. In this work, we find that applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training, which negatively impacts the performance of the models on language modelling and downstream tasks. In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document, eliminating potential distracting information from previous documents and significantly improving performance. Furthermore, we find that concatenating related documents can reduce some potential distractions during pre-training, and our proposed efficient retrieval-based sequence construction method, BM25Chunk, can improve in-context learning (+11.6\%), knowledge memorisation (+9.8\%), and context utilisation (+7.2\%) abilities of language models without sacrificing efficiency.

Related papers

When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars [34.80529788630565]
latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata at the beginning of texts in the pre-training data. We show that training with metadata helps improve model's performance when the given context is long enough to infer latent semantics.
arXiv Detail & Related papers (2025-04-24T13:56:43Z)
Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance. We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Scalable Influence and Fact Tracing for Large Language Model Pretraining [14.598556308631018]
Training data attribution (TDA) methods aim to attribute model outputs back to specific training examples. This paper refines existing gradient-based methods to work effectively at scale.
arXiv Detail & Related papers (2024-10-22T20:39:21Z)
Manual Verbalizer Enrichment for Few-Shot Text Classification [1.860409237919611]
acrshortmave is an approach for verbalizer construction by enrichment of class labels. Our model achieves state-of-the-art results while using significantly fewer resources.
arXiv Detail & Related papers (2024-10-08T16:16:47Z)
CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection [30.46562066023117]
We propose a novel method utilizing attributes in vision-language foundation models for incremental object detection. Our method constructs a Class-Agnostic Shared Attribute base (CASA) to capture common semantic information among incremental classes. Our method adds only 0.7% to parameter storage through parameter-efficient fine-tuning to significantly enhance the scalability and adaptability of our proposed method.
arXiv Detail & Related papers (2024-10-08T08:36:12Z)
In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents. We introduce approximate algorithms for finding related documents with efficient nearest neighbor search. We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z)
Causal Document-Grounded Dialogue Pre-training [81.16429056652483]
We present a causally-complete dataset construction strategy for building million-level DocGD pre-training corpora. Experiments on three benchmark datasets demonstrate that our causal pre-training achieves considerable and consistent improvements under fully-supervised, low-resource, few-shot, and zero-shot settings.
arXiv Detail & Related papers (2023-05-18T12:39:25Z)
Revisiting text decomposition methods for NLI-based factuality scoring of summaries [9.044665059626958]
We show that fine-grained decomposition is not always a winning strategy for factuality scoring. We also show that small changes to previously proposed entailment-based scoring methods can result in better performance.
arXiv Detail & Related papers (2022-11-30T09:54:37Z)
Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR) Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model. We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z)
A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention. Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z)
Pre-training via Paraphrasing [96.79972492585112]
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual paraphrasing objective. We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization. For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation.
arXiv Detail & Related papers (2020-06-26T14:43:43Z)
Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks. We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.