Fugu-MT 論文翻訳(概要): GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO

論文の概要: GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO

arxiv url: http://arxiv.org/abs/2508.15432v1
Date: Thu, 21 Aug 2025 10:35:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-22 16:26:46.28164
Title: GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO
Title（参考訳）: GraSP: SFTとDPOのための、スケーラブルな生成、品質タグ付け、および合成データの管理のための統一的なグラフベースのフレームワーク
Authors: Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, Gopal Sarda,
Abstract要約: 大規模言語モデル(LLM)のための総合的な合成データ生成フレームワークを提案する。本手法では,手作業による介入を最小限に抑えた複雑な対話フローをモデル化可能なモジュール型および構成型パイプラインを用いる。得られたデータセットは、SFTとDPOの両方のユースケースをサポートするフレキシブルなスキーマの下で構成され、多様なトレーニングへのシームレスな統合を可能にする。
参考スコア（独自算出の注目度）: 0.10051474951635875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.
Abstract（参考訳）: 大規模言語モデル(LLM)の進歩は、監視ファインチューニング(SFT)のための高品質なデータセットの可用性、直接優先度最適化(DPO)などのアライメントタスクに大きく依存している。本研究では,これらの学習パラダイムに適した,スケーラブルで構成可能で高忠実な合成データ生成を容易にする総合的な合成データ生成フレームワークを提案する。本手法では,手作業による介入を最小限に抑えた複雑な対話フローをモデル化可能なモジュール型および構成型パイプラインを用いる。このフレームワークは、ヒューリスティックルールとLCMに基づく評価を組み合わせて、OASST形式の会話から抽出したデータを自動フィルタリングし、スコア付けし、高品質な対話サンプルのキュレーションを保証する。得られたデータセットは、SFTとDPOの両方のユースケースをサポートするフレキシブルなスキーマの下で構成され、多様なトレーニングワークフローへのシームレスな統合を可能にする。これらのイノベーションは、大規模に合成会話データを生成し管理するための堅牢なソリューションを提供し、LLMトレーニングパイプラインにおけるデータ準備のオーバーヘッドを大幅に削減する。

論文の概要: GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO

関連論文リスト