Analysing The Impact of Sequence Composition on Language Model
Pre-Training
- URL: http://arxiv.org/abs/2402.13991v1
- Date: Wed, 21 Feb 2024 18:23:16 GMT
- Title: Analysing The Impact of Sequence Composition on Language Model
Pre-Training
- Authors: Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu,
Piotr Mi{\l}o\'s, Yuxiang Wu, Pasquale Minervini
- Abstract summary: We study the influence of the pre-training sequence composition strategy on the generalisation properties of the model.
Applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training.
In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document.
- Score: 20.929800523719187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most language model pre-training frameworks concatenate multiple documents
into fixed-length sequences and use causal masking to compute the likelihood of
each token given its context; this strategy is widely adopted due to its
simplicity and efficiency. However, to this day, the influence of the
pre-training sequence composition strategy on the generalisation properties of
the model remains under-explored. In this work, we find that applying causal
masking can lead to the inclusion of distracting information from previous
documents during pre-training, which negatively impacts the performance of the
models on language modelling and downstream tasks. In intra-document causal
masking, the likelihood of each token is only conditioned on the previous
tokens in the same document, eliminating potential distracting information from
previous documents and significantly improving performance. Furthermore, we
find that concatenating related documents can reduce some potential
distractions during pre-training, and our proposed efficient retrieval-based
sequence construction method, BM25Chunk, can improve in-context learning
(+11.6\%), knowledge memorisation (+9.8\%), and context utilisation (+7.2\%)
abilities of language models without sacrificing efficiency.
Related papers
- In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - Causal Document-Grounded Dialogue Pre-training [81.16429056652483]
We present a causally-complete dataset construction strategy for building million-level DocGD pre-training corpora.
Experiments on three benchmark datasets demonstrate that our causal pre-training achieves considerable and consistent improvements under fully-supervised, low-resource, few-shot, and zero-shot settings.
arXiv Detail & Related papers (2023-05-18T12:39:25Z) - Class Enhancement Losses with Pseudo Labels for Zero-shot Semantic
Segmentation [40.09476732999614]
Mask proposal models have significantly improved the performance of zero-shot semantic segmentation.
The use of a background' embedding during training in these methods is problematic as the resulting model tends to over-learn and assign all unseen classes as the background class instead of their correct labels.
This paper proposes novel class enhancement losses to bypass the use of the background embbedding during training, and simultaneously exploit the semantic relationship between text embeddings and mask proposals by ranking the similarity scores.
arXiv Detail & Related papers (2023-01-18T06:55:02Z) - Revisiting text decomposition methods for NLI-based factuality scoring
of summaries [9.044665059626958]
We show that fine-grained decomposition is not always a winning strategy for factuality scoring.
We also show that small changes to previously proposed entailment-based scoring methods can result in better performance.
arXiv Detail & Related papers (2022-11-30T09:54:37Z) - Using Deep Mixture-of-Experts to Detect Word Meaning Shift for TempoWiC [0.9543943371833467]
This paper describes the dma submission to the TempoWiC task, which achieves a macro-F1 score of 77.05%.
For further improvement, we integrate POS information and word semantic representation using a Mixture-of-Experts (MoE) approach.
arXiv Detail & Related papers (2022-11-07T11:28:34Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Pre-training via Paraphrasing [96.79972492585112]
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual paraphrasing objective.
We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization.
For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation.
arXiv Detail & Related papers (2020-06-26T14:43:43Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.