Data Engineering for Scaling Language Models to 128K Context
- URL: http://arxiv.org/abs/2402.10171v1
- Date: Thu, 15 Feb 2024 18:19:16 GMT
- Title: Data Engineering for Scaling Language Models to 128K Context
- Authors: Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi,
Yoon Kim and Hao Peng
- Abstract summary: We study the continual pretraining recipe for scaling language models' context lengths to 128K.
We find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance.
Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.
- Score: 98.41554785106902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the continual pretraining recipe for scaling language models'
context lengths to 128K, with a focus on data engineering. We hypothesize that
long context modeling, in particular \textit{the ability to utilize information
at arbitrary input locations}, is a capability that is mostly already acquired
through large-scale pretraining, and that this capability can be readily
extended to contexts substantially longer than seen during training~(e.g., 4K
to 128K) through lightweight continual pretraining on appropriate data mixture.
We investigate the \textit{quantity} and \textit{quality} of the data for
continual pretraining: (1) for quantity, we show that 500 million to 5 billion
tokens are enough to enable the model to retrieve information anywhere within
the 128K context; (2) for quality, our results equally emphasize \textit{domain
balance} and \textit{length upsampling}. Concretely, we find that naively
upsampling longer data on certain domains like books, a common practice of
existing work, gives suboptimal performance, and that a balanced domain mixture
is important. We demonstrate that continual pretraining of the full model on
1B-5B tokens of such data is an effective and affordable strategy for scaling
the context length of language models to 128K. Our recipe outperforms strong
open-source long-context models and closes the gap to frontier models like
GPT-4 128K.
Related papers
- InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents.
It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl.
Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model.
Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z) - Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models [21.90388980448712]
Training models to handle long contexts presents significant challenges.
We introduce Untie the Knots (textbfUtK), a novel data augmentation strategy employed during the continue pre-training phase.
We conduct extensive experiments on models with 7B and 72B parameters, trained on 20 billion tokens, demonstrating that UtK achieves 75% and 84.5% accurracy on RULER at 128K context length.
arXiv Detail & Related papers (2024-09-07T09:28:55Z) - Yi: Open Foundation Models by 01.AI [42.94680878285869]
Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models.
Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our fine chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Arena.
arXiv Detail & Related papers (2024-03-07T16:52:49Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Efficient Speech Translation with Pre-trained Models [13.107314023500349]
We investigate efficient strategies to build cascaded and end-to-end speech translation systems based on pre-trained models.
While the end-to-end models show superior translation performance to cascaded ones, the application of this technology has a limitation on the need for additional end-to-end training data.
arXiv Detail & Related papers (2022-11-09T15:07:06Z) - BERTIN: Efficient Pre-Training of a Spanish Language Model using
Perplexity Sampling [0.0]
Common Crawl might contain enough noise to make this pre-training sub-optimal.
We present a novel data-centric technique which enables the pre-training of language models in roughly half the amount of steps.
Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.
arXiv Detail & Related papers (2022-07-14T10:48:42Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.