Related papers: LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

URL: http://arxiv.org/abs/2505.17134v2
Date: Tue, 03 Jun 2025 03:04:17 GMT
Title: LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions
Authors: Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu,
Abstract summary: LongMagpie is a framework that automatically generates large-scale long-context instruction data.<n>We show that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks.
Score: 28.002824369635768
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.

Related papers

Flora: Effortless Context Construction to Arbitrary Length and Scale [71.12886910497284]
We introduce Flora, an effortless (human/LLM-free) long-context construction strategy.<n>Experiments on Llama3-8B-Instruct and QwQ-32B show that Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks.
arXiv Detail & Related papers (2025-07-26T04:21:21Z)
WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale [86.25450054683172]
WildLong extracts meta-information from real user queries to produce scalable data.<n>It supports multi-document reasoning, such as cross-document comparison and aggregation.<n>It surpasses existing open-source long-context-optimized models across benchmarks.
arXiv Detail & Related papers (2025-02-23T18:59:09Z)
Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model.<n>Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones.<n>Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z)
NExtLong: Toward Effective Long-Context Training without Long Documents [28.002824369635768]
We propose NExtLong, a novel framework for long-context data through Negative document Extension.<n> NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora.<n>Extensive experiments demonstrate that NExtLong achieves significant performance improvements compared to existing long-context synthesis approaches.
arXiv Detail & Related papers (2025-01-22T10:01:54Z)
Bootstrap Your Own Context Length [74.61148597039248]
We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only.<n>The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection.<n>We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens.
arXiv Detail & Related papers (2024-12-25T10:08:54Z)
ACER: Automatic Language Model Context Extension via Retrieval [36.40066695682234]
Current open-weight generalist long-context models are still lacking in practical long-context processing tasks. We build an textbfautomatic data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities.
arXiv Detail & Related papers (2024-10-11T17:57:06Z)
Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign) It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs) With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z)
LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment. We construct a long instruction-following dataset using Self-Instruct. We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.