Large Language Models for Automated Data Science: Introducing CAAFE for
Context-Aware Automated Feature Engineering
- URL: http://arxiv.org/abs/2305.03403v5
- Date: Thu, 28 Sep 2023 21:13:21 GMT
- Title: Large Language Models for Automated Data Science: Introducing CAAFE for
Context-Aware Automated Feature Engineering
- Authors: Noah Hollmann, Samuel M\"uller and Frank Hutter
- Abstract summary: We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets.
Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets.
We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
- Score: 52.09178018466104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the field of automated machine learning (AutoML) advances, it becomes
increasingly important to incorporate domain knowledge into these systems. We
present an approach for doing so by harnessing the power of large language
models (LLMs). Specifically, we introduce Context-Aware Automated Feature
Engineering (CAAFE), a feature engineering method for tabular datasets that
utilizes an LLM to iteratively generate additional semantically meaningful
features for tabular datasets based on the description of the dataset. The
method produces both Python code for creating new features and explanations for
the utility of the generated features. Despite being methodologically simple,
CAAFE improves performance on 11 out of 14 datasets -- boosting mean ROC AUC
performance from 0.798 to 0.822 across all dataset - similar to the improvement
achieved by using a random forest instead of logistic regression on our
datasets. Furthermore, CAAFE is interpretable by providing a textual
explanation for each generated feature. CAAFE paves the way for more extensive
semi-automation in data science tasks and emphasizes the significance of
context-aware solutions that can extend the scope of AutoML systems to semantic
AutoML. We release our $\href{https://github.com/automl/CAAFE}{code}$, a simple
$\href{https://colab.research.google.com/drive/1mCA8xOAJZ4MaB_alZvyARTMjhl6RZf0a}{demo}$
and a $\href{https://pypi.org/project/caafe/}{python\ package}$.
Related papers
- ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.
We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z) - Enhancing Knowledge Retrieval with In-Context Learning and Semantic Search through Generative AI [3.9773527114058855]
We propose a novel methodology that combines the generative capabilities of Large Language Models with the fast and accurate retrieval capabilities of vector databases.
The developed model, Generative Text Retrieval (GTR), is adaptable to both unstructured and structured data with minor refinement.
The refined model, Generative Tabular Text Retrieval (GTR-T), demonstrated its efficiency in large database querying.
arXiv Detail & Related papers (2024-06-13T23:08:06Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - FIND: A Function Description Benchmark for Evaluating Interpretability
Methods [86.80718559904854]
This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating automated interpretability methods.
FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate.
We evaluate methods that use pretrained language models to produce descriptions of function behavior in natural language and code.
arXiv Detail & Related papers (2023-09-07T17:47:26Z) - AutoML-GPT: Automatic Machine Learning with GPT [74.30699827690596]
We propose developing task-oriented prompts and automatically utilizing large language models (LLMs) to automate the training pipeline.
We present the AutoML-GPT, which employs GPT as the bridge to diverse AI models and dynamically trains models with optimized hyper parameters.
This approach achieves remarkable results in computer vision, natural language processing, and other challenging areas.
arXiv Detail & Related papers (2023-05-04T02:09:43Z) - STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning
Pipeline Facilitating Data Analysis and Algorithm Comparison [0.49034553215430216]
STREAMLINE is a simple, transparent, end-to-end AutoML pipeline.
It is specifically designed to compare performance between datasets, ML algorithms, and other AutoML tools.
arXiv Detail & Related papers (2022-06-23T22:40:58Z) - Automatic Componentwise Boosting: An Interpretable AutoML System [1.1709030738577393]
We propose an AutoML system that constructs an interpretable additive model that can be fitted using a highly scalable componentwise boosting algorithm.
Our system provides tools for easy model interpretation such as visualizing partial effects and pairwise interactions.
Despite its restriction to an interpretable model space, our system is competitive in terms of predictive performance on most data sets.
arXiv Detail & Related papers (2021-09-12T18:34:33Z) - Privileged Zero-Shot AutoML [16.386335031156]
This work improves the quality of automated machine learning (AutoML) systems by using dataset and function descriptions.
We show that zero-shot AutoML reduces running and prediction times from minutes to milliseconds, consistently across datasets.
arXiv Detail & Related papers (2021-06-25T16:31:05Z) - DriveML: An R Package for Driverless Machine Learning [7.004573941239386]
DriveML helps in implementing some of the pillars of an automated machine learning pipeline.
The main benefits of DriveML are in development time savings, reduce developer's errors, optimal tuning of machine learning models and errors.
arXiv Detail & Related papers (2020-05-01T16:40:25Z) - AutoFIS: Automatic Feature Interaction Selection in Factorization Models
for Click-Through Rate Prediction [75.16836697734995]
We propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS)
AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence.
AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service.
arXiv Detail & Related papers (2020-03-25T06:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.