FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables
- URL: http://arxiv.org/abs/2403.06367v1
- Date: Mon, 11 Mar 2024 01:44:14 GMT
- Title: FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables
- Authors: Danrui Qi, Weiling Zheng, Jiannan Wang,
- Abstract summary: Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development.
We propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware queries from one-to-many relationship tables.
Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools.
- Score: 4.058220332950672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development. To augment good features, data scientists need to come up with SQL queries manually, which is time-consuming. Featuretools [1] is a widely used tool by the data science community to automatically augment the training data by extracting new features from relevant tables. It represents each feature as a group-by aggregation SQL query on relevant tables and can automatically generate these SQL queries. However, it does not include predicates in these queries, which significantly limits its application in many real-world scenarios. To overcome this limitation, we propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware SQL queries from one-to-many relationship tables. This extension is not trivial because considering predicates will exponentially increase the number of candidate queries. As a result, the original Featuretools framework, which materializes all candidate queries, will not work and needs to be redesigned. We formally define the problem and model it as a hyperparameter optimization problem. We discuss how the Bayesian Optimization can be applied here and propose a novel warm-up strategy to optimize it. To make our algorithm more practical, we also study how to identify promising attribute combinations for predicates. We show that how the beam search idea can partially solve the problem and propose several techniques to further optimize it. Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools and other baselines. The code is open-sourced at https://github.com/sfu-db/FeatAug
Related papers
- UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - Database-Augmented Query Representation for Information Retrieval [59.57065228857247]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu)
DAQu augments the original query with various (query-related) metadata across multiple tables.
We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database.
arXiv Detail & Related papers (2024-06-23T05:02:21Z) - CHESS: Contextual Harnessing for Efficient SQL Synthesis [1.9506402593665235]
We propose a new pipeline that retrieves relevant data and context, selects an efficient schema, and synthesizes correct and efficient queries.
Our method achieves new state-of-the-art performance on the cross-domain challenging BIRD dataset.
arXiv Detail & Related papers (2024-05-27T01:54:16Z) - Augment before You Try: Knowledge-Enhanced Table Question Answering via
Table Expansion [57.53174887650989]
Table question answering is a popular task that assesses a model's ability to understand and interact with structured data.
Existing methods either convert both the table and external knowledge into text, which neglects the structured nature of the table.
We propose a simple yet effective method to integrate external information in a given table.
arXiv Detail & Related papers (2024-01-28T03:37:11Z) - JoinGym: An Efficient Query Optimization Environment for Reinforcement
Learning [58.71541261221863]
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost.
We present JoinGym, a query optimization environment for bushy reinforcement learning (RL)
Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset.
arXiv Detail & Related papers (2023-07-21T17:00:06Z) - BitE : Accelerating Learned Query Optimization in a Mixed-Workload
Environment [0.36700088931938835]
BitE is a novel ensemble learning model using database statistics and metadata to tune a learned query for enhancing performance.
Our model achieves 19.6% more improved queries and 15.8% less regressed queries compared to the existing traditional methods.
arXiv Detail & Related papers (2023-06-01T16:05:33Z) - Improving Text-to-SQL Semantic Parsing with Fine-grained Query
Understanding [84.04706075621013]
We present a general-purpose, modular neural semantic parsing framework based on token-level fine-grained query understanding.
Our framework consists of three modules: named entity recognizer (NER), neural entity linker (NEL) and neural entity linker (NSP)
arXiv Detail & Related papers (2022-09-28T21:00:30Z) - S$^2$SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder
for Text-to-SQL Parsers [66.78665327694625]
We propose S$2$, injecting Syntax to question- encoder graph for Text-to- relational parsing.
We also employ the decoupling constraint to induce diverse edge embedding, which further improves the network's performance.
Experiments on the Spider and robustness setting Spider-Syn demonstrate that the proposed approach outperforms all existing methods when pre-training models are used.
arXiv Detail & Related papers (2022-03-14T09:49:15Z) - Pay More Attention to History: A Context Modeling Strategy for
Conversational Text-to-SQL [8.038535788630542]
One of the most intractable problem of conversational text-to- domain is modeling the semantics of multi-turn queries.
This paper shows that explicit modeling the semantic changes by adding each turn and the summarization of the whole context can bring better performance.
arXiv Detail & Related papers (2021-12-16T09:41:04Z) - "What makes my queries slow?": Subgroup Discovery for SQL Workload
Analysis [1.3124513975412255]
We introduce an original approach rooted on Subgroup Discovery.
We show how to instantiate and develop this generic data-mining framework.
We also provide a visualization tool for interactive knowledge discovery.
arXiv Detail & Related papers (2021-08-09T09:44:13Z) - TableQnA: Answering List Intent Queries With Web Tables [12.941073798838167]
We focus on answering two classes of queries with HTML tables: those seeking lists of entities and those seeking superlative entities.
Existing approaches train machine learning models to select the answer from the candidates.
We develop novel features to compute structure-aware match and train a machine learning model.
arXiv Detail & Related papers (2020-01-10T01:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.