Related papers: Extending Input Contexts of Language Models through Training on Segmented Sequences

Extending Input Contexts of Language Models through Training on Segmented Sequences

URL: http://arxiv.org/abs/2310.14633v3
Date: Wed, 19 Jun 2024 14:00:27 GMT
Title: Extending Input Contexts of Language Models through Training on Segmented Sequences
Authors: Petros Karypis, Julian McAuley, George Karypis,
Abstract summary: We develop a training procedure to extend the input context size of pretrained models with no architectural changes. We demonstrate our method can extend input contexts by a factor of 4x while improving perplexity.
Score: 34.42433279419559
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effectively training language models on long inputs poses many technical challenges. As a cost consideration, languages models are pretrained on a fixed sequence length before being adapted to longer sequences. We explore various methods for adapting models to longer inputs by training on segmented sequences and an interpolation-based method for extending absolute positional embeddings. We develop a training procedure to extend the input context size of pretrained models with no architectural changes and no additional memory costs than training on the original input lengths. By sub-sampling segments from long inputs while maintaining their original position the model is able to learn new positional interactions. Our method benefits both models trained with absolute positional embeddings, by extending their input contexts, as well as popular relative positional embedding methods showing a reduced perplexity on sequences longer than they were trained on. We demonstrate our method can extend input contexts by a factor of 4x while improving perplexity.

Related papers

Pretraining Language Models for Diachronic Linguistic Change Discovery [8.203894221271302]
We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus.
arXiv Detail & Related papers (2025-04-07T21:51:32Z)
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum [30.46329559544246]
We introduce dataset decomposition, a novel variable sequence length training technique. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
arXiv Detail & Related papers (2024-05-21T22:26:01Z)
TAMS: Translation-Assisted Morphological Segmentation [3.666125285899499]
We present a sequence-to-sequence model for canonical morpheme segmentation. Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data. While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
arXiv Detail & Related papers (2024-03-21T21:23:35Z)
Text-to-Code Generation with Modality-relative Pre-training [6.546893206010636]
We investigate how sequence tokens can be adapted depending on which modality they belong to. We focus on text-to-code generation and observe consistent improvements across two backbone models and two test sets.
arXiv Detail & Related papers (2024-02-08T16:17:24Z)
A Frustratingly Easy Improvement for Position Embeddings via Random Padding [68.75670223005716]
In this paper, we propose a simple but effective strategy, Random Padding, without any modifications to existing pre-trained language models. Experiments show that Random Padding can significantly improve model performance on the instances whose answers are located at rear positions.
arXiv Detail & Related papers (2023-05-08T17:08:14Z)
Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
Token-wise Curriculum Learning for Neural Machine Translation [94.93133801641707]
Existing curriculum learning approaches to Neural Machine Translation (NMT) require sufficient sampling amounts of "easy" samples from training data at the early training stage. We propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples. Our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages.
arXiv Detail & Related papers (2021-03-20T03:57:59Z)
Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty Estimates [52.164757178369804]
Recent advances in transfer learning for natural language processing in conjunction with active learning open the possibility to significantly reduce the necessary annotation budget. We conduct an empirical study of various Bayesian uncertainty estimation methods and Monte Carlo dropout options for deep pre-trained models in the active learning framework. We also demonstrate that to acquire instances during active learning, a full-size Transformer can be substituted with a distilled version, which yields better computational performance.
arXiv Detail & Related papers (2021-01-20T13:59:25Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.