Related papers: GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO

GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO

URL: http://arxiv.org/abs/2508.15432v1
Date: Thu, 21 Aug 2025 10:35:41 GMT
Title: GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO
Authors: Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, Gopal Sarda,
Abstract summary: We present a comprehensive synthetic data generation framework for large language models (LLMs)<n>Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention.<n>The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training.
Score: 0.10051474951635875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.

Related papers

FABRIC: Framework for Agent-Based Realistic Intelligence Creation [3.940391073007047]
Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments.<n>We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision.
arXiv Detail & Related papers (2025-10-20T18:20:22Z)
Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models [99.85131798240808]
We introduce a novel generative framework called textitGuided Topology Diffusion (GTD)<n>Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process.<n>At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards.<n>Experiments show that GTD can generate highly task-adaptive, sparse, and efficient communication topologies.
arXiv Detail & Related papers (2025-10-09T05:28:28Z)
Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation [0.0]
This work introduces a modular framework for synthetic long-context data generation via prompt-based interaction with large language models (LLMs)<n>The framework supports multiple training and alignment objectives, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO)<n>It encompasses four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples.
arXiv Detail & Related papers (2025-09-01T07:08:45Z)
Large Language Models for Data Synthesis [17.333852085464176]
Large Language Models (LLMs) have potential as flexible, high-dimensional priors over real-world distributions.<n>We introduce LLM Synthor, a framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback.<n>By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data.
arXiv Detail & Related papers (2025-05-20T13:35:38Z)
RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs [3.41612427812159]
In digital content creation tools, users express their needs through natural language queries that must be mapped to API calls.<n>Existing approaches to synthetic data generation fail to replicate real-world data distributions.<n>We present a novel router-based architecture that generates high-quality synthetic training data.
arXiv Detail & Related papers (2025-05-15T16:53:45Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis [54.15152681093108]
We introduce Reference-Level Feedback, a paradigm that extracts desirable characteristics from carefully curated reference samples to guide the synthesis of higher-quality instruction-response pairs.<n>Experiments demonstrate that Reference-Level Feedback consistently outperforms traditional sample-level feedback methods, generalizes across model architectures, and produces high-quality and diverse data at low cost.
arXiv Detail & Related papers (2025-02-06T21:29:00Z)
TabularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data [0.42881773214459123]
We introduce the Tabular Auto-Regressive Generative Network (TabularARGN), a flexible framework to handle mixed-type, multivariate, and sequential datasets.<n>By training on all possible conditional probabilities, TabularARGN supports advanced features such as fairness-aware generation, imputation, and conditional generation on any subset of columns.
arXiv Detail & Related papers (2025-01-21T10:06:19Z)
ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis [80.34000499166648]
We propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues.<n>We apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow.<n>Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.
arXiv Detail & Related papers (2024-10-24T05:45:04Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources. We propose a data processing framework that integrates a Processing Module and an Analyzing Module. The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.