FeatNavigator: Automatic Feature Augmentation on Tabular Data
- URL: http://arxiv.org/abs/2406.09534v1
- Date: Thu, 13 Jun 2024 18:44:48 GMT
- Title: FeatNavigator: Automatic Feature Augmentation on Tabular Data
- Authors: Jiaming Liang, Chuan Lei, Xiao Qin, Jiani Zhang, Asterios Katsifodimos, Christos Faloutsos, Huzefa Rangwala,
- Abstract summary: FeatNavigator is a framework that explores and integrates high-quality features in relational tables for machine learning (ML) models.
We show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.
- Score: 29.913561808461612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator's search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.
Related papers
- LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations.
Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - OutRank: Speeding up AutoML-based Model Search for Large Sparse Data
sets with Cardinality-aware Feature Ranking [0.0]
We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection.
The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches.
arXiv Detail & Related papers (2023-09-04T12:07:20Z) - FeatGeNN: Improving Model Performance for Tabular Data with
Correlation-based Feature Extraction [0.22792085593908193]
FeatGeNN is a convolutional method that extracts and creates new features using correlation as a pooling function.
We evaluate our method on various benchmark datasets and demonstrate that FeatGeNN outperforms existing AutoFE approaches regarding model performance.
arXiv Detail & Related papers (2023-08-15T01:48:11Z) - Unified Embedding: Battle-Tested Feature Representations for Web-Scale
ML Systems [29.53535556926066]
Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems.
This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used across many different categorical features.
We propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware.
arXiv Detail & Related papers (2023-05-20T05:35:40Z) - Traceable Automatic Feature Transformation via Cascading Actor-Critic
Agents [25.139229855367088]
Feature transformation is an essential task to boost the effectiveness and interpretability of machine learning (ML)
We formulate the feature transformation task as an iterative, nested process of feature generation and selection.
We show 24.7% improvements in F1 scores compared with SOTAs and robustness in high-dimensional data.
arXiv Detail & Related papers (2022-12-27T08:20:19Z) - Towards Open-World Feature Extrapolation: An Inductive Graph Learning
Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning.
Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z) - Lightweight Single-Image Super-Resolution Network with Attentive
Auxiliary Feature Learning [73.75457731689858]
We develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A$2$F) for SISR.
Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods.
arXiv Detail & Related papers (2020-11-13T06:01:46Z) - AutoFIS: Automatic Feature Interaction Selection in Factorization Models
for Click-Through Rate Prediction [75.16836697734995]
We propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS)
AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence.
AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service.
arXiv Detail & Related papers (2020-03-25T06:53:54Z) - ARDA: Automatic Relational Data Augmentation for Machine Learning [23.570173866941612]
We present system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set.
Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join.
arXiv Detail & Related papers (2020-03-21T21:55:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.