Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines
- URL: http://arxiv.org/abs/2505.15874v2
- Date: Mon, 10 Nov 2025 14:42:35 GMT
- Title: Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines
- Authors: Yuhang Ge, Yachuan Liu, Zhangyan Ye, Yuren Mao, Yunjun Gao,
- Abstract summary: We introduce Text-to-Pipeline, a new task that translates NL data preparation instructions into DP pipelines.<n>Parrot is a large-scale benchmark to support systematic evaluation.<n>ParROT is built by mining transformation patterns from production pipelines and instantiating them on 23,009 real-world tables.
- Score: 18.75611679837171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data preparation (DP) transforms raw data into a form suitable for downstream applications, typically by composing operations into executable pipelines. Building such pipelines is time-consuming and requires sophisticated programming skills, posing a significant barrier for non-experts. To lower this barrier, we introduce Text-to-Pipeline, a new task that translates NL data preparation instructions into DP pipelines, and PARROT, a large-scale benchmark to support systematic evaluation. To ensure realistic DP scenarios, PARROT is built by mining transformation patterns from production pipelines and instantiating them on 23,009 real-world tables, resulting in ~18,000 tasks spanning 16 core operators. Our empirical evaluation on PARROT reveals a critical failure mode in cutting-edge LLMs: they struggle not only with multi-step compositional logic but also with semantic parameter grounding. We thus establish a strong baseline with Pipeline-Agent, an execution-aware agent that iteratively reflects on intermediate states. While it achieves state-of-the-art performance, a significant gap remains, underscoring the deep, unsolved challenges for PARROT. It provides the essential, large-scale testbed for developing and evaluating the next generation of autonomous data preparation agentic systems.
Related papers
- NGDB-Zoo: Towards Efficient and Scalable Neural Graph Databases Training [55.35217340229661]
We present NGDB-Zoo, a unified framework that resolves bottlenecks by synergizing operator-level training with semantic augmentation.<n>We demonstrate that NGDB-Zoo maintains high GPU utilization across diverse logical patterns and significantly mitigates friction in hybrid neuro-symbolic reasoning.
arXiv Detail & Related papers (2026-02-25T05:46:42Z) - SemPipes -- Optimizable Semantic Data Operators for Tabular Machine Learning Pipelines [12.816711873869984]
We introduce SemPipes, a novel declarative programming model that integrates semantic data operators into ML pipelines.<n>SemPipes synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context.<n>We show that semantic operators substantially improve end-to-end predictive performance for both expert-designed and agent-generated pipelines.
arXiv Detail & Related papers (2026-02-04T23:36:29Z) - Autonomous Data Processing using Meta-Agents [2.3732259124656907]
We present textbfAutonomous Data Processing using Meta-agents (ADP-MA), a framework that dynamically constructs, executes, and iteratively refines data processing pipelines.<n> ADP-MA emphasizes context-aware optimization, adaptive workload partitioning, and progressive sampling for scalability.<n>We demonstrate ADP-MA through an interactive demo that showcases pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks.
arXiv Detail & Related papers (2026-01-30T20:58:17Z) - MontePrep: Monte-Carlo-Driven Automatic Data Preparation without Target Data Instances [25.78808887206003]
In commercial systems, a pervasive for automatic data preparation (ADP) is to transfer data from disparate sources to targets with standardized schema specifications.<n>We propose an effective end-to-end ADP framework, MontePrep, which enables training-free pipeline synthesis with zero target-instance requirements.<n>MontePrep is formulated as an open-source large language model (LLM) powered tree-structured search problem.
arXiv Detail & Related papers (2025-09-22T09:17:41Z) - Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z) - Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z) - FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language [48.79534869177174]
We introduce a new pre-training dataset curation pipeline based on FineWeb.<n>We show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets.<n>We scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset.
arXiv Detail & Related papers (2025-06-26T01:01:47Z) - Pipeline and Dataset Generation for Automated Fact-checking in Almost
Any Language [0.0]
This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data.
The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation.
We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines.
arXiv Detail & Related papers (2023-12-15T19:43:41Z) - DSPy: Compiling Declarative Language Model Calls into Self-Improving
Pipelines [44.772892598128784]
We introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs.
Within minutes of compiling, a few lines of DSPy allow GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines.
arXiv Detail & Related papers (2023-10-05T17:37:25Z) - Deep Pipeline Embeddings for AutoML [11.168121941015015]
AutoML is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise.
Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components.
This paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline.
arXiv Detail & Related papers (2023-05-23T12:40:38Z) - Demonstrate-Search-Predict: Composing retrieval and language models for
knowledge-intensive NLP [77.817293104436]
We propose a framework that relies on passing natural language texts in sophisticated pipelines between an LM and an RM.
We have written novel DSP programs for answering questions in open-domain, multi-hop, and conversational settings.
arXiv Detail & Related papers (2022-12-28T18:52:44Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - PipeTransformer: Automated Elastic Pipelining for Distributed Training
of Transformers [47.194426122333205]
PipeTransformer is a distributed training algorithm for Transformer models.
It automatically adjusts the pipelining and data parallelism by identifying and freezing some layers during the training.
We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets.
arXiv Detail & Related papers (2021-02-05T13:39:31Z) - AutoWeka4MCPS-AVATAR: Accelerating Automated Machine Learning Pipeline
Composition and Optimisation [13.116806430326513]
We propose a novel method to evaluate the validity of ML pipelines, without their execution, using a surrogate model (AVATAR)
The AVATAR generates a knowledge base by automatically learning the capabilities and effects of ML algorithms on datasets' characteristics.
Instead of executing the original ML pipeline to evaluate its validity, the AVATAR evaluates its surrogate model constructed by capabilities and effects of the ML pipeline components.
arXiv Detail & Related papers (2020-11-21T14:05:49Z) - Unsupervised Parallel Corpus Mining on Web Data [53.74427402568838]
We present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner.
Our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.
arXiv Detail & Related papers (2020-09-18T02:38:01Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - AVATAR -- Machine Learning Pipeline Evaluation Using Surrogate Model [10.83607599315401]
We propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR)
Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution.
arXiv Detail & Related papers (2020-01-30T02:53:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.