DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
- URL: http://arxiv.org/abs/2512.16676v1
- Date: Thu, 18 Dec 2025 15:46:15 GMT
- Title: DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
- Authors: Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang,
- Abstract summary: DataFlow is a unified and LLM-driven data preparation framework.<n>System-level abstractions enable modular, reusable, and composable data transformations.<n>DataFlow consistently improves downstream Large Language Models performance.
- Score: 42.191938707504406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
Related papers
- SemPipes -- Optimizable Semantic Data Operators for Tabular Machine Learning Pipelines [12.816711873869984]
We introduce SemPipes, a novel declarative programming model that integrates semantic data operators into ML pipelines.<n>SemPipes synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context.<n>We show that semantic operators substantially improve end-to-end predictive performance for both expert-designed and agent-generated pipelines.
arXiv Detail & Related papers (2026-02-04T23:36:29Z) - Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z) - LLM/Agent-as-Data-Analyst: A Survey [54.08761322298559]
Large language models (LLMs) and agent techniques have brought a fundamental shift in the functionality and development paradigm of data analysis tasks.<n>LLMs enable complex data understanding, natural language, semantic analysis functions, and autonomous pipeline orchestration.
arXiv Detail & Related papers (2025-09-28T17:31:38Z) - FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering [1.3599496385950987]
FlowETL is an example-based autonomous pipeline architecture designed to automatically standardise and prepare input datasets.<n>A Planning Engine uses a paired input-output datasets sample to construct a transformation plan, which is then applied by an worker to the source.<n>The results show promising generalisation capabilities across 14 datasets of various domains, file structures, and file sizes.
arXiv Detail & Related papers (2025-07-30T21:46:22Z) - VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation [1.0798445660490976]
Large Language Models (LLMs) are gaining popularity for hardware design automation, particularly through Register Transfer Level (RTL) code generation.<n>We construct a robust Verilog dataset through an automated three-pronged process involving database (DB) creation and management.<n>The resulting dataset comprises 20,392 Verilog samples, 751 MB of Verilog code data, which is the largest high-quality Verilog dataset for fine-tuning to our knowledge.
arXiv Detail & Related papers (2025-07-09T17:06:54Z) - SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner [53.54568352375669]
We introduce **SWE-Flow**, a novel data synthesis framework grounded in Test-Driven Development (TDD)<n>Unlike existing software engineering data that rely on human-submitted issues, **SWE-Flow** automatically infers incremental development steps directly from unit tests.<n>We generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the **SWE-Flow-Eval** benchmark.
arXiv Detail & Related papers (2025-06-10T17:23:33Z) - KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes [17.76903247601012]
We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines.<n>We show that these pipelines test the end-to-end capabilities of AI systems on data processing.<n>Our results show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, existing out-of-box models fall short.
arXiv Detail & Related papers (2025-06-06T21:18:45Z) - ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines [4.556817293680431]
We introduce ELT-Bench, an end-to-end benchmark to assess the capabilities of AI agents to build Extract-Load-Transform pipelines.<n>ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains.<n>We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench.
arXiv Detail & Related papers (2025-04-07T08:03:36Z) - WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.46456444315693]
We presentLLM, a data-centric framework to enhance the capability of large language models in workflow orchestration.
It first constructs a large-scale fine-tuningBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories.
LlamaLlama demonstrates a strong capacity to orchestrate complex APIs, while also achieving notable generalization performance.
arXiv Detail & Related papers (2024-11-08T09:58:02Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.