Related papers: Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering

Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering

URL: http://arxiv.org/abs/2305.03403v5
Date: Thu, 28 Sep 2023 21:13:21 GMT
Title: Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering
Authors: Noah Hollmann, Samuel M\"uller and Frank Hutter
Abstract summary: We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets. Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets. We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
Score: 52.09178018466104
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the field of automated machine learning (AutoML) advances, it becomes increasingly important to incorporate domain knowledge into these systems. We present an approach for doing so by harnessing the power of large language models (LLMs). Specifically, we introduce Context-Aware Automated Feature Engineering (CAAFE), a feature engineering method for tabular datasets that utilizes an LLM to iteratively generate additional semantically meaningful features for tabular datasets based on the description of the dataset. The method produces both Python code for creating new features and explanations for the utility of the generated features. Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets -- boosting mean ROC AUC performance from 0.798 to 0.822 across all dataset - similar to the improvement achieved by using a random forest instead of logistic regression on our datasets. Furthermore, CAAFE is interpretable by providing a textual explanation for each generated feature. CAAFE paves the way for more extensive semi-automation in data science tasks and emphasizes the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML. We release our $\href{https://github.com/automl/CAAFE}{code}$, a simple $\href{https://colab.research.google.com/drive/1mCA8xOAJZ4MaB_alZvyARTMjhl6RZf0a}{demo}$ and a $\href{https://pypi.org/project/caafe/}{python\ package}$.

Related papers

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z)
ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
Enhancing Knowledge Retrieval with In-Context Learning and Semantic Search through Generative AI [3.9773527114058855]
We propose a novel methodology that combines the generative capabilities of Large Language Models with the fast and accurate retrieval capabilities of vector databases. The developed model, Generative Text Retrieval (GTR), is adaptable to both unstructured and structured data with minor refinement. The refined model, Generative Tabular Text Retrieval (GTR-T), demonstrated its efficiency in large database querying.
arXiv Detail & Related papers (2024-06-13T23:08:06Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
FIND: A Function Description Benchmark for Evaluating Interpretability Methods [86.80718559904854]
This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. We evaluate methods that use pretrained language models to produce descriptions of function behavior in natural language and code.
arXiv Detail & Related papers (2023-09-07T17:47:26Z)
AutoML-GPT: Automatic Machine Learning with GPT [74.30699827690596]
We propose developing task-oriented prompts and automatically utilizing large language models (LLMs) to automate the training pipeline. We present the AutoML-GPT, which employs GPT as the bridge to diverse AI models and dynamically trains models with optimized hyper parameters. This approach achieves remarkable results in computer vision, natural language processing, and other challenging areas.
arXiv Detail & Related papers (2023-05-04T02:09:43Z)
STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison [0.49034553215430216]
STREAMLINE is a simple, transparent, end-to-end AutoML pipeline. It is specifically designed to compare performance between datasets, ML algorithms, and other AutoML tools.
arXiv Detail & Related papers (2022-06-23T22:40:58Z)
Automatic Componentwise Boosting: An Interpretable AutoML System [1.1709030738577393]
We propose an AutoML system that constructs an interpretable additive model that can be fitted using a highly scalable componentwise boosting algorithm. Our system provides tools for easy model interpretation such as visualizing partial effects and pairwise interactions. Despite its restriction to an interpretable model space, our system is competitive in terms of predictive performance on most data sets.
arXiv Detail & Related papers (2021-09-12T18:34:33Z)
Privileged Zero-Shot AutoML [16.386335031156]
This work improves the quality of automated machine learning (AutoML) systems by using dataset and function descriptions. We show that zero-shot AutoML reduces running and prediction times from minutes to milliseconds, consistently across datasets.
arXiv Detail & Related papers (2021-06-25T16:31:05Z)
DriveML: An R Package for Driverless Machine Learning [7.004573941239386]
DriveML helps in implementing some of the pillars of an automated machine learning pipeline. The main benefits of DriveML are in development time savings, reduce developer's errors, optimal tuning of machine learning models and errors.
arXiv Detail & Related papers (2020-05-01T16:40:25Z)
AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction [75.16836697734995]
We propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS) AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence. AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service.
arXiv Detail & Related papers (2020-03-25T06:53:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.