AutoPipeline: Synthesize Data Pipelines By-Target Using Reinforcement
Learning and Search
- URL: http://arxiv.org/abs/2106.13861v1
- Date: Fri, 25 Jun 2021 19:44:01 GMT
- Title: AutoPipeline: Synthesize Data Pipelines By-Target Using Reinforcement
Learning and Search
- Authors: Junwen Yang, Yeye He, Surajit Chaudhuri
- Abstract summary: We propose to automate complex data pipelines with both string transformations and table-manipulation operators.
We propose a novel "by-target" paradigm that allows users to easily specify the desired pipeline.
We develop an Auto-Pipeline system that learns to synthesize pipelines using reinforcement learning and search.
- Score: 19.53147565613595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has made significant progress in helping users to automate single
data preparation steps, such as string-transformations and table-manipulation
operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to
automate multiple such steps end-to-end, by synthesizing complex data pipelines
with both string transformations and table-manipulation operators. We propose a
novel "by-target" paradigm that allows users to easily specify the desired
pipeline, which is a significant departure from the traditional by-example
paradigm. Using by-target, users would provide input tables (e.g., csv or json
files), and point us to a "target table" (e.g., an existing database table or
BI dashboard) to demonstrate how the output from the desired pipeline would
schematically "look like". While the problem is seemingly underspecified, our
unique insight is that implicit table constraints such as FDs and keys can be
exploited to significantly constrain the space to make the problem tractable.
We develop an Auto-Pipeline system that learns to synthesize pipelines using
reinforcement learning and search. Experiments on large numbers of real
pipelines crawled from GitHub suggest that Auto-Pipeline can successfully
synthesize 60-70% of these complex pipelines (up to 10 steps) in 10-20 seconds
on average.
Related papers
- FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering [1.3599496385950987]
FlowETL is an example-based autonomous pipeline architecture designed to automatically standardise and prepare input datasets.<n>A Planning Engine uses a paired input-output datasets sample to construct a transformation plan, which is then applied by an worker to the source.<n>The results show promising generalisation capabilities across 14 datasets of various domains, file structures, and file sizes.
arXiv Detail & Related papers (2025-07-30T21:46:22Z) - ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.
We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z) - Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans [3.2362171533623054]
We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their Machine Learning pipelines.
We extract "logical query plans" from ML pipeline code relying on popular libraries.
Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code.
arXiv Detail & Related papers (2024-07-10T11:35:02Z) - Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks [54.31708859631821]
We propose a family of operations, called routing functions, to enhance vision-language (VL) alignment in low-rank bottlenecks.
In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods.
arXiv Detail & Related papers (2024-03-14T13:27:42Z) - Deep Pipeline Embeddings for AutoML [11.168121941015015]
AutoML is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise.
Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components.
This paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline.
arXiv Detail & Related papers (2023-05-23T12:40:38Z) - Towards Personalized Preprocessing Pipeline Search [52.59156206880384]
ClusterP3S is a novel framework for Personalized Preprocessing Pipeline Search via Clustering.
We propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines.
Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.
arXiv Detail & Related papers (2023-02-28T05:45:05Z) - A sequence-to-sequence approach for document-level relation extraction [4.906513405712846]
Document-level relation extraction (DocRE) requires integrating information within and across sentences.
Seq2rel can learn the subtasks of DocRE end-to-end, replacing a pipeline of task-specific components.
arXiv Detail & Related papers (2022-04-03T16:03:19Z) - SapientML: Synthesizing Machine Learning Pipelines by Learning from
Human-Written Solutions [28.718446733713183]
We propose an AutoML SapientML that can learn from a corpus of existing datasets and their human-written pipelines.
We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets.
Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.
arXiv Detail & Related papers (2022-02-18T20:45:47Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - PipeTransformer: Automated Elastic Pipelining for Distributed Training
of Transformers [47.194426122333205]
PipeTransformer is a distributed training algorithm for Transformer models.
It automatically adjusts the pipelining and data parallelism by identifying and freezing some layers during the training.
We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets.
arXiv Detail & Related papers (2021-02-05T13:39:31Z) - MLCask: Efficient Management of Component Evolution in Collaborative
Data Analytics Pipelines [29.999324319722508]
We address two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask.
We define and accelerate the metric-driven merge operation by pruning the pipeline search tree using reusable history records and pipeline compatibility information.
The effectiveness of MLCask is evaluated through an extensive study over several real-world deployment cases.
arXiv Detail & Related papers (2020-10-17T13:34:48Z) - Task-Oriented Dialogue as Dataflow Synthesis [158.77123205487334]
We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph.
A dialogue agent maps each user utterance to a program that extends this graph.
We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people.
arXiv Detail & Related papers (2020-09-24T00:35:26Z) - TODS: An Automated Time Series Outlier Detection System [70.88663649631857]
TODS is a highly modular system that supports easy pipeline construction.<n>Tods supports 70 primitives, including data processing, time series processing, feature analysis, detection algorithms, and a reinforcement module.
arXiv Detail & Related papers (2020-09-18T15:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.