Related papers: SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

URL: http://arxiv.org/abs/2202.10451v1
Date: Fri, 18 Feb 2022 20:45:47 GMT
Title: SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions
Authors: Ripon K. Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang Hu, Hiroaki Yoshida, Sarfraz Khurshid, Mukul R. Prasad
Abstract summary: We propose an AutoML SapientML that can learn from a corpus of existing datasets and their human-written pipelines. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.
Score: 28.718446733713183
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses a machine-learned model to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using syntactic constraints derived from the corpus and the machine-learned model. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 2 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.

Related papers

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML [56.565200973244146]
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline. Recent works have started exploiting large language models (LLM) to lessen such burden. This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML.
arXiv Detail & Related papers (2024-10-03T20:01:09Z)
ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans [3.2362171533623054]
We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their Machine Learning pipelines. We extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code.
arXiv Detail & Related papers (2024-07-10T11:35:02Z)
From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs [62.496139001509114]
We introduce a novel tool invocation pipeline designed to control massive real-world APIs. This pipeline mirrors the human task-solving process, addressing complicated real-life user queries. Empirical evaluations of our Sum2Act pipeline on the ToolBench benchmark show significant performance improvements.
arXiv Detail & Related papers (2024-02-28T08:42:23Z)
Deep Pipeline Embeddings for AutoML [11.168121941015015]
AutoML is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise. Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components. This paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline.
arXiv Detail & Related papers (2023-05-23T12:40:38Z)
SubStrat: A Subset-Based Strategy for Faster AutoML [5.833272638548153]
SubStrat is an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small subset. It then employs the AutoML tool on the small subset, and finally, it refines the resulted pipeline by executing a restricted, much shorter, AutoML process on the large dataset.
arXiv Detail & Related papers (2022-06-07T07:44:06Z)
Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields. Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z)
VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition [57.06900573003609]
VolcanoML is a framework that decomposes a large AutoML search space into smaller ones. It supports a Volcano-style execution model, akin to the one supported by modern database systems. Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies.
arXiv Detail & Related papers (2021-07-19T13:23:57Z)
Incremental Search Space Construction for Machine Learning Pipeline Synthesis [4.060731229044571]
Automated machine learning (AutoML) aims for constructing machine learning (ML) pipelines automatically. We propose a data-centric approach based on meta-features for pipeline construction. We prove the effectiveness and competitiveness of our approach on 28 data sets used in well-established AutoML benchmarks.
arXiv Detail & Related papers (2021-01-26T17:17:49Z)
Evolution of Scikit-Learn Pipelines with Dynamic Structured Grammatical Evolution [1.5224436211478214]
This paper describes a novel grammar-based framework that adapts Dynamic Structured Grammatical Evolution (DSGE) to the evolution of Scikit-Learn classification pipelines. The experimental results include comparing AutoML-DSGE to another grammar-based AutoML framework, Resilient ClassificationPipeline Evolution (RECIPE)
arXiv Detail & Related papers (2020-04-01T09:31:34Z)
Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.