MontePrep: Monte-Carlo-Driven Automatic Data Preparation without Target Data Instances
- URL: http://arxiv.org/abs/2509.17553v1
- Date: Mon, 22 Sep 2025 09:17:41 GMT
- Title: MontePrep: Monte-Carlo-Driven Automatic Data Preparation without Target Data Instances
- Authors: Congcong Ge, Yachuan Liu, Yixuan Tang, Yifan Zhu, Yaofeng Tu, Yunjun Gao,
- Abstract summary: In commercial systems, a pervasive for automatic data preparation (ADP) is to transfer data from disparate sources to targets with standardized schema specifications.<n>We propose an effective end-to-end ADP framework, MontePrep, which enables training-free pipeline synthesis with zero target-instance requirements.<n>MontePrep is formulated as an open-source large language model (LLM) powered tree-structured search problem.
- Score: 25.78808887206003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In commercial systems, a pervasive requirement for automatic data preparation (ADP) is to transfer relational data from disparate sources to targets with standardized schema specifications. Previous methods rely on labor-intensive supervision signals or target table data access permissions, limiting their usage in real-world scenarios. To tackle these challenges, we propose an effective end-to-end ADP framework MontePrep, which enables training-free pipeline synthesis with zero target-instance requirements. MontePrep is formulated as an open-source large language model (LLM) powered tree-structured search problem. It consists of three pivot components, i.e., a data preparation action sandbox (DPAS), a fundamental pipeline generator (FPG), and an execution-aware pipeline optimizer (EPO). We first introduce DPAS, a lightweight action sandbox, to navigate the search-based pipeline generation. The design of DPAS circumvents exploration of infeasible pipelines. Then, we present FPG to build executable DP pipelines incrementally, which explores the predefined action sandbox by the LLM-powered Monte Carlo Tree Search. Furthermore, we propose EPO, which invokes pipeline execution results from sources to targets to evaluate the reliability of the generated pipelines in FPG. In this way, unreasonable pipelines are eliminated, thus facilitating the search process from both efficiency and effectiveness perspectives. Extensive experimental results demonstrate the superiority of MontePrep with significant improvement against five state-of-the-art competitors.
Related papers
- Autonomous Data Processing using Meta-Agents [2.3732259124656907]
We present textbfAutonomous Data Processing using Meta-agents (ADP-MA), a framework that dynamically constructs, executes, and iteratively refines data processing pipelines.<n> ADP-MA emphasizes context-aware optimization, adaptive workload partitioning, and progressive sampling for scalability.<n>We demonstrate ADP-MA through an interactive demo that showcases pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks.
arXiv Detail & Related papers (2026-01-30T20:58:17Z) - AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models [59.7059443712562]
AdaPtis is a training system for large language models (LLMs) that supports adaptive pipeline parallelism.<n>Extensive experiments show that AdaPtis achieves an average speedup of 1.42x (up to 2.14x) over Megatron-LM I-1F1B.
arXiv Detail & Related papers (2025-09-28T08:05:13Z) - Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training [40.67232484556671]
Elastic Pipeline Parallelism (EPP) orchestrates token-level PP and batch-level PP to adapt to resource and workload heterogeneity.<n>InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.
arXiv Detail & Related papers (2025-09-25T15:01:25Z) - Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines [23.421567721746765]
We introduce Text-to-Pipeline, a task that translates data preparation instructions into DP pipelines.<n>We also develop a benchmark named PARROT to support systematic evaluation.<n>Despite this improvement, there remains substantial room for progress on Text-to-Pipeline.
arXiv Detail & Related papers (2025-05-21T15:40:53Z) - Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models [63.765846080050906]
This paper proposes a novel parameter-efficient action planner using large language models (PEAP-LLM) to generate a single-step instruction at each location.<n>Experiments show the superiority of our proposed model on REVERIE compared to the previous state-of-the-art.
arXiv Detail & Related papers (2025-05-12T12:38:20Z) - PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding [4.734824660843965]
PipeSpec is a framework that generalizes speculative decoding to $k$ models arranged in a hierarchical pipeline.<n>We show that PipeSpec achieves up to 2.54$times$ speedup while outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-05-02T20:29:31Z) - PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines [0.8148009849453334]
Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains.<n>To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential.<n>In this paper, we introduce PROMPTEVALS, a dataset of 2087 pipeline prompts with 12623 corresponding assertion criteria.
arXiv Detail & Related papers (2025-04-20T21:04:23Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Deep Pipeline Embeddings for AutoML [11.168121941015015]
AutoML is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise.
Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components.
This paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline.
arXiv Detail & Related papers (2023-05-23T12:40:38Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision
Datasets from 3D Scans [103.92680099373567]
This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world.
Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information.
Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks.
arXiv Detail & Related papers (2021-10-11T04:21:46Z) - TODS: An Automated Time Series Outlier Detection System [70.88663649631857]
TODS is a highly modular system that supports easy pipeline construction.<n>Tods supports 70 primitives, including data processing, time series processing, feature analysis, detection algorithms, and a reinforcement module.
arXiv Detail & Related papers (2020-09-18T15:36:43Z) - Inverting the Pose Forecasting Pipeline with SPF2: Sequential Pointcloud
Forecasting for Sequential Pose Forecasting [106.3504366501894]
Self-driving vehicles and robotic manipulation systems often forecast future object poses by first detecting and tracking objects.
This detect-then-forecast pipeline is expensive to scale, as pose forecasting algorithms typically require labeled sequences of object poses.
We propose to first forecast 3D sensor data and then detect/track objects on the predicted point cloud sequences to obtain future poses.
This makes it less expensive to scale pose forecasting, as the sensor data forecasting task requires no labels.
arXiv Detail & Related papers (2020-03-18T17:54:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.