Related papers: FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables

FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables

URL: http://arxiv.org/abs/2403.06367v1
Date: Mon, 11 Mar 2024 01:44:14 GMT
Title: FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables
Authors: Danrui Qi, Weiling Zheng, Jiannan Wang,
Abstract summary: Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development. We propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware queries from one-to-many relationship tables. Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools.
Score: 4.058220332950672
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development. To augment good features, data scientists need to come up with SQL queries manually, which is time-consuming. Featuretools [1] is a widely used tool by the data science community to automatically augment the training data by extracting new features from relevant tables. It represents each feature as a group-by aggregation SQL query on relevant tables and can automatically generate these SQL queries. However, it does not include predicates in these queries, which significantly limits its application in many real-world scenarios. To overcome this limitation, we propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware SQL queries from one-to-many relationship tables. This extension is not trivial because considering predicates will exponentially increase the number of candidate queries. As a result, the original Featuretools framework, which materializes all candidate queries, will not work and needs to be redesigned. We formally define the problem and model it as a hyperparameter optimization problem. We discuss how the Bayesian Optimization can be applied here and propose a novel warm-up strategy to optimize it. To make our algorithm more practical, we also study how to identify promising attribute combinations for predicates. We show that how the beam search idea can partially solve the problem and propose several techniques to further optimize it. Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools and other baselines. The code is open-sourced at https://github.com/sfu-db/FeatAug

Related papers

Weaver: Interweaving SQL and LLM for Table Reasoning [63.09519234853953]
Weaver generates a flexible, step-by-step plan that combinessql for structured data retrieval with LLMs for semantic processing.<n>Weaver consistently outperforms state-of-the-art methods across four TableQA datasets, reducing both API calls and error rates.
arXiv Detail & Related papers (2025-05-25T03:27:37Z)
Learning Metadata-Agnostic Representations for Text-to-SQL In-Context Example Selection [0.3277163122167434]
In-context learning (ICL) is a powerful paradigm where large language models (LLMs) benefit from task demonstrations added to the prompt. We propose a method to align representations of natural language questions and those of queries in a shared embedding space. Our technique, dubbed MARLO, uses query structure to model querying intent without over-indexing on underlying database metadata.
arXiv Detail & Related papers (2024-10-17T21:45:55Z)
SynTQA: Synergistic Table-based Question Answering via Mixture of Text-to-SQL and E2E TQA [25.09488366689108]
Text-to- parsing and end-to-end question answering (E2E TQA) are two main approaches for Table-based Question Answering task. Despite success on multiple benchmarks, they have yet to be compared and their synergy remains unexplored. We identify different strengths and weaknesses through evaluating state-of-the-art models on benchmark datasets.
arXiv Detail & Related papers (2024-09-25T07:18:45Z)
RoundTable: Leveraging Dynamic Schema and Contextual Autocomplete for Enhanced Query Precision in Tabular Question Answering [11.214912072391108]
Real-world datasets often feature a vast array of attributes and complex values. Traditional methods cannot fully relay the datasets size and complexity to the Large Language Models. We propose a novel framework that leverages Full-Text Search (FTS) on the input table.
arXiv Detail & Related papers (2024-08-22T13:13:06Z)
UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics. We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z)
Database-Augmented Query Representation for Information Retrieval [59.57065228857247]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu) DAQu augments the original query with various (query-related) metadata across multiple tables. We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database.
arXiv Detail & Related papers (2024-06-23T05:02:21Z)
Augment before You Try: Knowledge-Enhanced Table Question Answering via Table Expansion [57.53174887650989]
Table question answering is a popular task that assesses a model's ability to understand and interact with structured data. Existing methods either convert both the table and external knowledge into text, which neglects the structured nature of the table. We propose a simple yet effective method to integrate external information in a given table.
arXiv Detail & Related papers (2024-01-28T03:37:11Z)
JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning [58.71541261221863]
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost. We present JoinGym, a query optimization environment for bushy reinforcement learning (RL) Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset.
arXiv Detail & Related papers (2023-07-21T17:00:06Z)
Improving Text-to-SQL Semantic Parsing with Fine-grained Query Understanding [84.04706075621013]
We present a general-purpose, modular neural semantic parsing framework based on token-level fine-grained query understanding. Our framework consists of three modules: named entity recognizer (NER), neural entity linker (NEL) and neural entity linker (NSP)
arXiv Detail & Related papers (2022-09-28T21:00:30Z)
S$^2$SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers [66.78665327694625]
We propose S$2$, injecting Syntax to question- encoder graph for Text-to- relational parsing. We also employ the decoupling constraint to induce diverse edge embedding, which further improves the network's performance. Experiments on the Spider and robustness setting Spider-Syn demonstrate that the proposed approach outperforms all existing methods when pre-training models are used.
arXiv Detail & Related papers (2022-03-14T09:49:15Z)
"What makes my queries slow?": Subgroup Discovery for SQL Workload Analysis [1.3124513975412255]
We introduce an original approach rooted on Subgroup Discovery. We show how to instantiate and develop this generic data-mining framework. We also provide a visualization tool for interactive knowledge discovery.
arXiv Detail & Related papers (2021-08-09T09:44:13Z)
TableQnA: Answering List Intent Queries With Web Tables [12.941073798838167]
We focus on answering two classes of queries with HTML tables: those seeking lists of entities and those seeking superlative entities. Existing approaches train machine learning models to select the answer from the candidates. We develop novel features to compute structure-aware match and train a machine learning model.
arXiv Detail & Related papers (2020-01-10T01:43:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.