Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation
- URL: http://arxiv.org/abs/2509.01185v2
- Date: Thu, 04 Sep 2025 17:22:16 GMT
- Title: Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation
- Authors: Seganrasan Subramanian, Abhigya Verma,
- Abstract summary: This work introduces a modular framework for synthetic long-context data generation via prompt-based interaction with large language models (LLMs)<n>The framework supports multiple training and alignment objectives, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO)<n>It encompasses four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability of large language models (LLMs) to process and reason over long textual inputs is critical for a wide range of real-world applications. However, progress in this area is significantly constrained by the absence of high-quality, diverse, and verifiable long-context datasets suitable for both training and evaluation. This work introduces a modular, extensible framework for synthetic long-context data generation via prompt-based interaction with LLMs. The framework supports multiple training and alignment objectives, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). It encompasses four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples. Through templated prompting, a model-agnostic architecture, and metadata-enriched outputs, the proposed approach facilitates scalable, controllable, and purpose-aligned dataset creation for advancing long-context capabilities in LLMs.
Related papers
- Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs [26.566444932983526]
High-quality long-context data is essential for training large language models.<n>We present LiteLong, a resource-efficient method for synthesizing long-context data.
arXiv Detail & Related papers (2025-09-19T04:07:46Z) - GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO [0.10051474951635875]
We present a comprehensive synthetic data generation framework for large language models (LLMs)<n>Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention.<n>The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training.
arXiv Detail & Related papers (2025-08-21T10:35:41Z) - Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization [56.97588709890706]
LongMab-PO is a novel framework that generates high-quality and diverse responses for long-context modeling tasks.<n> Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs.
arXiv Detail & Related papers (2025-08-19T16:33:55Z) - Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning [55.41828729623907]
We present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond supervised fine-tuning.<n>The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals, and Dynamic Reference Scheduling approach.
arXiv Detail & Related papers (2025-06-06T05:40:39Z) - WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale [86.25450054683172]
WildLong extracts meta-information from real user queries to produce scalable data.<n>It supports multi-document reasoning, such as cross-document comparison and aggregation.<n>It surpasses existing open-source long-context-optimized models across benchmarks.
arXiv Detail & Related papers (2025-02-23T18:59:09Z) - Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model.<n>Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones.<n>Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - Advancing Transformer Architecture in Long-Context Large Language
Models: A Comprehensive Survey [18.930417261395906]
Transformer-based Large Language Models (LLMs) have been applied in diverse areas such as knowledge bases, human interfaces, and dynamic agents.
This article offers a survey of the recent advancement in Transformer-based LLM architectures aimed at enhancing the long-context capabilities of LLMs.
arXiv Detail & Related papers (2023-11-21T04:59:17Z) - Stabilized In-Context Learning with Pre-trained Language Models for Few
Shot Dialogue State Tracking [57.92608483099916]
Large pre-trained language models (PLMs) have shown impressive unaided performance across many NLP tasks.
For more complex tasks such as dialogue state tracking (DST), designing prompts that reliably convey the desired intent is nontrivial.
We introduce a saliency model to limit dialogue text length, allowing us to include more exemplars per query.
arXiv Detail & Related papers (2023-02-12T15:05:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.