Related papers: Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

URL: http://arxiv.org/abs/2510.25804v1
Date: Wed, 29 Oct 2025 06:21:08 GMT
Title: Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
Authors: Haoran Deng, Yingyu Lin, Zhenghao Lin, Xiao Liu, Yizhou Sun, Yi-An Ma, Yeyun Gong,
Abstract summary: We introduce LongFilter, a framework for curating training data tailored to long-context pretraining.<n>LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings.<n>Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.
Score: 67.46386646195818
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.

Related papers

LongAttn: Selecting Long-context Training Data via Token-level Attention [16.30530770590871]
LongAttn is a token-level framework to measure the long-range dependencies for the data.<n>We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code)
arXiv Detail & Related papers (2025-02-24T05:51:53Z)
NExtLong: Toward Effective Long-Context Training without Long Documents [28.002824369635768]
We propose NExtLong, a novel framework for long-context data through Negative document Extension.<n> NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora.<n>Extensive experiments demonstrate that NExtLong achieves significant performance improvements compared to existing long-context synthesis approaches.
arXiv Detail & Related papers (2025-01-22T10:01:54Z)
LIFT: Improving Long Context Understanding Through Long Input Fine-Tuning [35.31849814789343]
This paper introduces Long Input Fine-Tuning (LIFT) for long context modeling.<n>LIFT enables efficient processing of lengthy inputs without the computational burden of offline long-context adaptation.<n>The framework is further enhanced by integrating in-context learning and pre-LIFT supervised fine-tuning.
arXiv Detail & Related papers (2024-12-18T09:04:55Z)
How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.<n>We find that code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data.<n>Our final model, ProLong-8B, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z)
LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens. We develop two novel methods for creating synthetic data. LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z)
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models [13.091271774417867]
Long-context modeling capabilities are important for large language models (LLMs) in various applications. We propose a data mining framework textbfProLong that can assign each training sample with a long dependency score. Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies.
arXiv Detail & Related papers (2024-05-28T07:36:56Z)
Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign) It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs) With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z)
LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment. We construct a long instruction-following dataset using Self-Instruct. We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.