WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale
- URL: http://arxiv.org/abs/2502.16684v1
- Date: Sun, 23 Feb 2025 18:59:09 GMT
- Title: WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale
- Authors: Jiaxi Li, Xingxing Zhang, Xun Wang, Xiaolong Huang, Li Dong, Liang Wang, Si-Qing Chen, Wei Lu, Furu Wei,
- Abstract summary: WildLong extracts meta-information from real user queries to produce scalable data.<n>It supports multi-document reasoning, such as cross-document comparison and aggregation.<n>It surpasses existing open-source long-context-optimized models across benchmarks.
- Score: 86.25450054683172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data synthesis methods focus narrowly on objectives like fact retrieval and summarization, restricting their generalizability to complex, real-world tasks. WildLong extracts meta-information from real user queries, models co-occurrence relationships via graph-based methods, and employs adaptive generation to produce scalable data. It extends beyond single-document tasks to support multi-document reasoning, such as cross-document comparison and aggregation. Our models, finetuned on 150K instruction-response pairs synthesized using WildLong, surpasses existing open-source long-context-optimized models across benchmarks while maintaining strong performance on short-context tasks without incorporating supplementary short-context data. By generating a more diverse and realistic long-context instruction dataset, WildLong enhances LLMs' ability to generalize to complex, real-world reasoning over long contexts, establishing a new paradigm for long-context data synthesis.
Related papers
- Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation [15.975325252309554]
We introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of Large Language Models.
Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data.
We demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench.
arXiv Detail & Related papers (2025-04-17T04:46:57Z) - Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model.<n>Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones.<n>Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z) - Bootstrap Your Own Context Length [74.61148597039248]
We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only.<n>The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection.<n>We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens.
arXiv Detail & Related papers (2024-12-25T10:08:54Z) - Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.<n>We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.<n>Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z) - ACER: Automatic Language Model Context Extension via Retrieval [36.40066695682234]
Current open-weight generalist long-context models are still lacking in practical long-context processing tasks.
We build an textbfautomatic data synthesis pipeline that mimics this process using short-context LMs.
The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities.
arXiv Detail & Related papers (2024-10-11T17:57:06Z) - LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens.
We develop two novel methods for creating synthetic data.
LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z) - Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model [22.07414287186125]
Quest is a query-centric data method aggregating semantically relevant yet diverse documents.<n>It uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords.<n>Experiments demonstrate Quest's superior performance on long-context tasks, achieving remarkable results with context lengths of up to 1M tokens.
arXiv Detail & Related papers (2024-05-30T08:50:55Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.