AutoSynth: Automated Workflow Optimization for High-Quality Synthetic Dataset Generation via Monte Carlo Tree Search
- URL: http://arxiv.org/abs/2511.09488v1
- Date: Thu, 13 Nov 2025 01:57:51 GMT
- Title: AutoSynth: Automated Workflow Optimization for High-Quality Synthetic Dataset Generation via Monte Carlo Tree Search
- Authors: Shuzhen Bi, Chang Song, Siyu Song, Jinze Lv, Jian Chen, Xinyun Wang, Aimin Zhou, Hao Hao,
- Abstract summary: Supervised fine-tuning (SFT) of large language models (LLMs) for specialized tasks requires high-quality datasets.<n>Existing automated workflow methods face a cold start problem: they require labeled datasets for reward modeling.<n>We introduce Auto Synth, a framework that automates workflow discovery and optimization without reference datasets.
- Score: 19.631058407921728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Supervised fine-tuning (SFT) of large language models (LLMs) for specialized tasks requires high-quality datasets, but manual curation is prohibitively expensive. Synthetic data generation offers scalability, but its effectiveness relies on complex, multi-stage workflows, integrating prompt engineering and model orchestration. Existing automated workflow methods face a cold start problem: they require labeled datasets for reward modeling, which is especially problematic for subjective, open-ended tasks with no objective ground truth. We introduce AutoSynth, a framework that automates workflow discovery and optimization without reference datasets by reframing the problem as a Monte Carlo Tree Search guided by a novel dataset-free hybrid reward. This reward enables meta-learning through two LLM-as-judge components: one evaluates sample quality using dynamically generated task-specific metrics, and another assesses workflow code and prompt quality. Experiments on subjective educational tasks show that while expert-designed workflows achieve higher human preference rates (96-99% win rates vs. AutoSynth's 40-51%), models trained on AutoSynth-generated data dramatically outperform baselines (40-51% vs. 2-5%) and match or surpass expert workflows on certain metrics, suggesting discovery of quality dimensions beyond human intuition. These results are achieved while reducing human effort from 5-7 hours to just 30 minutes (>90% reduction). AutoSynth tackles the cold start issue in data-centric AI, offering a scalable, cost-effective method for subjective LLM tasks. Code: https://github.com/bisz9918-maker/AutoSynth.
Related papers
- SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation [3.2150327776278576]
This paper introduces the Synthetic dataset Quality Metric (SDQM) to assess data quality for object detection tasks.<n>In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model.<n>It provides actionable insights for improving dataset quality, minimizing the need for costly iterative training.
arXiv Detail & Related papers (2025-10-08T03:01:26Z) - CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks [59.69339605157168]
CoT-Self-Instruct is a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought.<n>In verifiable reasoning, our synthetic data significantly outperforms existing training datasets.<n>For non-verifiable instruction-following tasks, our method surpasses the performance of both human and standard Self-Instruct training data.
arXiv Detail & Related papers (2025-07-31T17:38:50Z) - AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents [60.881609323604685]
Agent Synth is a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets.<n>Our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations.
arXiv Detail & Related papers (2025-06-17T05:46:52Z) - MDCrow: Automating Molecular Dynamics Workflows with Large Language Models [0.6130124744675498]
We introduce MDCrow, an agentic LLM assistant capable of automating Molecular dynamics simulations.<n>We assess MDCrow's performance across 25 tasks of varying required subtasks and difficulty, and we evaluate the agent's robustness to both difficulty and prompt style.
arXiv Detail & Related papers (2025-02-13T18:19:20Z) - Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic Collaboration [81.45763823762682]
This work aims to bridge the gap by investigating the problem of data synthesis through multi-agent sampling.<n>We introduce Tree Search-based Orchestrated Agents(TOA), where the workflow evolves iteratively during the sequential sampling process.<n>Our experiments on alignment, machine translation, and mathematical reasoning demonstrate that multi-agent sampling significantly outperforms single-agent sampling as inference compute scales.
arXiv Detail & Related papers (2024-12-22T15:16:44Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Efficacy of Synthetic Data as a Benchmark [3.2968976262860408]
We investigate the effectiveness of generating synthetic data through large language models (LLMs)
Our experiments show that while synthetic data can effectively capture performance of various methods for simpler tasks, it falls short for more complex tasks like named entity recognition.
We propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks.
arXiv Detail & Related papers (2024-09-18T13:20:23Z) - AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning [54.47116888545878]
AutoAct is an automatic agent learning framework for QA.
It does not rely on large-scale annotated data and synthetic planning trajectories from closed-source models.
arXiv Detail & Related papers (2024-01-10T16:57:24Z) - Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and
the Case of Information Extraction [28.51694365908817]
This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by large language models.
We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation.
arXiv Detail & Related papers (2023-03-07T18:48:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.