Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
- URL: http://arxiv.org/abs/2510.25804v1
- Date: Wed, 29 Oct 2025 06:21:08 GMT
- Title: Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
- Authors: Haoran Deng, Yingyu Lin, Zhenghao Lin, Xiao Liu, Yizhou Sun, Yi-An Ma, Yeyun Gong,
- Abstract summary: We introduce LongFilter, a framework for curating training data tailored to long-context pretraining.<n>LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings.<n>Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.
- Score: 67.46386646195818
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.
Related papers
- LongAttn: Selecting Long-context Training Data via Token-level Attention [16.30530770590871]
LongAttn is a token-level framework to measure the long-range dependencies for the data.<n>We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code)
arXiv Detail & Related papers (2025-02-24T05:51:53Z) - NExtLong: Toward Effective Long-Context Training without Long Documents [28.002824369635768]
We propose NExtLong, a novel framework for long-context data through Negative document Extension.<n> NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora.<n>Extensive experiments demonstrate that NExtLong achieves significant performance improvements compared to existing long-context synthesis approaches.
arXiv Detail & Related papers (2025-01-22T10:01:54Z) - LIFT: Improving Long Context Understanding Through Long Input Fine-Tuning [35.31849814789343]
This paper introduces Long Input Fine-Tuning (LIFT) for long context modeling.<n>LIFT enables efficient processing of lengthy inputs without the computational burden of offline long-context adaptation.<n>The framework is further enhanced by integrating in-context learning and pre-LIFT supervised fine-tuning.
arXiv Detail & Related papers (2024-12-18T09:04:55Z) - How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.<n>We find that code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data.<n>Our final model, ProLong-8B, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z) - LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens.
We develop two novel methods for creating synthetic data.
LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z) - Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models [13.091271774417867]
Long-context modeling capabilities are important for large language models (LLMs) in various applications.
We propose a data mining framework textbfProLong that can assign each training sample with a long dependency score.
Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies.
arXiv Detail & Related papers (2024-05-28T07:36:56Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment.
We construct a long instruction-following dataset using Self-Instruct.
We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.