Related papers: From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

URL: http://arxiv.org/abs/2506.03968v1
Date: Wed, 04 Jun 2025 14:00:47 GMT
Title: From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding
Authors: Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao,
Abstract summary: We construct a dataset of 1 million instructions, called SynthQuestions.<n>We demonstrate that models trained on it achieve leading performance on several common benchmarks.
Score: 33.009759731505746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at https://github.com/Ignoramus0817/SynthQuestions.

Related papers

CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback [21.627909324788597]
Acquiring high-quality instruction-code pairs is essential for training Large Language Models.<n>We propose CodeEvo, a framework that synthesizes code data through iterative interactions between two LLM agents.
arXiv Detail & Related papers (2025-07-25T16:12:51Z)
Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report [11.70656700216213]
Construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability.<n>We propose a systematic instruction data synthesis framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, and a model deficiency diagnosis.<n>Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing 1.5 million instructions.
arXiv Detail & Related papers (2025-07-09T15:59:02Z)
RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs [3.41612427812159]
In digital content creation tools, users express their needs through natural language queries that must be mapped to API calls.<n>Existing approaches to synthetic data generation fail to replicate real-world data distributions.<n>We present a novel router-based architecture that generates high-quality synthetic training data.
arXiv Detail & Related papers (2025-05-15T16:53:45Z)
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction [83.0216122783429]
Web Reconstruction (WebR) is a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents.<n>We show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks.
arXiv Detail & Related papers (2025-04-22T04:07:13Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
AIR: Complex Instruction Generation via Automatic Iterative Refinement [29.639832268719363]
Current approaches to generating complex instructions are often irrelevant to the current instruction requirements.<n>We propose a novel automatic iterative refinement framework to generate complex instructions with constraints.<n>We construct the AIR-10K dataset with 10K complex instructions and demonstrate that instructions generated with our approach significantly improve the model's ability to follow complex instructions.
arXiv Detail & Related papers (2025-02-25T02:39:57Z)
EpiCoder: Encompassing Diversity and Complexity in Code Generation [49.170195362149386]
Existing methods for code generation use code snippets as seed data.<n>We introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features.<n>Our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios.
arXiv Detail & Related papers (2025-01-08T18:58:15Z)
Learn2Synth: Learning Optimal Data Synthesis using Hypergradients for Brain Image Segmentation [11.82940051568101]
Domain randomization through synthesis is a powerful strategy to train networks that are unbiased with respect to the domain of the input images.<n>We introduce Learn2 Synth, a novel procedure in which synthesis parameters are learned using a small set of real labeled data.<n>We develop parametric and nonparametric strategies to enhance synthetic images in a way that improves the performance of the segmentation network.
arXiv Detail & Related papers (2024-11-23T00:52:49Z)
Synthetic continued pretraining [29.6872772403251]
We propose synthetic continued pretraining on a small corpus of domain-specific documents. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm. We show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
arXiv Detail & Related papers (2024-09-11T17:21:59Z)
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [59.60208063956459]
Large Language Models (LLMs) require high quality instruction data for effective alignment.<n>We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions.
arXiv Detail & Related papers (2024-07-29T20:42:59Z)
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z)
Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis [51.04181562775778]
We present a novel approach to automatically synthesize "wayfinding instructions" for an embodied robot agent. Our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld.
arXiv Detail & Related papers (2024-03-18T05:38:07Z)
Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models. In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.