Related papers: Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

URL: http://arxiv.org/abs/2405.13226v1
Date: Tue, 21 May 2024 22:26:01 GMT
Title: Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Authors: Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel,
Abstract summary: We introduce dataset decomposition, a novel variable sequence length training technique. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
Score: 30.46329559544246
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length. However, this method of concatenation can lead to cross-document attention within a sequence, which is neither a desirable learning signal nor computationally efficient. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch size, sampling simultaneously from all buckets with a curriculum. In contrast to the concat-and-chunk baseline, which incurs a fixed attention cost at every step of training, our proposed method incurs a penalty proportional to the actual document lengths at each step, resulting in significant savings in training time. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy 3x faster compared to the baseline. Our method not only enables efficient pretraining on long sequences but also scales effectively with dataset size. Lastly, we shed light on a critical yet less studied aspect of training large language models: the distribution and curriculum of sequence lengths, which results in a non-negligible difference in performance.

Related papers

Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM [49.2709992932292]
Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances.<n>Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead.<n>This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies.
arXiv Detail & Related papers (2025-03-10T10:52:50Z)
How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. ProLong-8B, which is from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z)
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z)
Bucket Pre-training is All You Need [9.332544709626875]
Large language models (LLMs) have demonstrated exceptional performance across various natural language processing tasks. The conventional fixed-length data composition strategy for pretraining, which involves concatenating and splitting documents, can introduce noise and limit the model's ability to capture long-range dependencies. We propose a multi-bucket data composition method that moves beyond the fixed-length paradigm, offering a more flexible and efficient approach to pretraining.
arXiv Detail & Related papers (2024-07-10T09:27:23Z)
Equipping Transformer with Random-Access Reading for Long-Context Understanding [9.433800833564279]
Long-context modeling presents a significant challenge for transformer-based large language models. We propose a novel reading strategy that enables transformers to efficiently process long documents without examining every token.
arXiv Detail & Related papers (2024-05-21T21:41:07Z)
LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment. We construct a long instruction-following dataset using Self-Instruct. We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z)
E^2-LLM: Efficient and Extreme Length Extension of Large Language Models [74.1254067728251]
We propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced cost. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
arXiv Detail & Related papers (2024-01-13T02:11:20Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention. Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z)
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training [91.99700930388998]
We propose Positional Skip-wisE training that simulates long inputs using a fixed context window. PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning. We have successfully extended the LLaMA model to 128k tokens using a 2k training context window.
arXiv Detail & Related papers (2023-09-19T08:03:38Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Adapting Pretrained Text-to-Text Models for Long Text Sequences [39.62224414485055]
We adapt an existing pretrained text-to-text model for long-sequence inputs. We build a long-context model that achieves competitive performance on long-text QA tasks.
arXiv Detail & Related papers (2022-09-21T00:41:07Z)
Sequence Length is a Domain: Length-based Overfitting in Transformer Models [0.0]
In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches. We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
arXiv Detail & Related papers (2021-09-15T13:25:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.