GAN-based Tabular Data Generator for Constructing Synopsis in
Approximate Query Processing: Challenges and Solutions
- URL: http://arxiv.org/abs/2212.09015v2
- Date: Thu, 9 Nov 2023 14:53:21 GMT
- Title: GAN-based Tabular Data Generator for Constructing Synopsis in
Approximate Query Processing: Challenges and Solutions
- Authors: Mohammadali Fallahian, Mohsen Dorodchi, Kyle Kreth
- Abstract summary: Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis)
This study explores the novel utilization of Generative Adversarial Networks (GANs) in the generation of tabular data that can be employed in AQP for synopsis construction.
Our findings demonstrate that advanced GAN variations exhibit a promising capacity to generate high-fidelity synopses, potentially transforming the efficiency and effectiveness of AQP in data-driven systems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In data-driven systems, data exploration is imperative for making real-time
decisions. However, big data is stored in massive databases that are difficult
to retrieve. Approximate Query Processing (AQP) is a technique for providing
approximate answers to aggregate queries based on a summary of the data
(synopsis) that closely replicates the behavior of the actual data, which can
be useful where an approximate answer to the queries would be acceptable in a
fraction of the real execution time. This study explores the novel utilization
of Generative Adversarial Networks (GANs) in the generation of tabular data
that can be employed in AQP for synopsis construction. We thoroughly
investigate the unique challenges posed by the synopsis construction process,
including maintaining data distribution characteristics, handling bounded
continuous and categorical data, and preserving semantic relationships and then
introduce the advancement of tabular GAN architectures that overcome these
challenges. Furthermore, we propose and validate a suite of statistical metrics
tailored for assessing the reliability of the GAN-generated synopses. Our
findings demonstrate that advanced GAN variations exhibit a promising capacity
to generate high-fidelity synopses, potentially transforming the efficiency and
effectiveness of AQP in data-driven systems.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models.
We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z) - Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation [72.70046559930555]
We propose a generic RAG approach called Adaptive Note-Enhanced RAG (Adaptive-Note) for complex QA tasks.
Specifically, Adaptive-Note introduces an overarching view of knowledge growth, iteratively gathering new information in the form of notes.
In addition, we employ an adaptive, note-based stop-exploration strategy to decide "what to retrieve and when to stop" to encourage sufficient knowledge exploration.
arXiv Detail & Related papers (2024-10-11T14:03:29Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems [10.71630696651595]
Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks have garnered significant interest within database and AI communities.
silos of multimodal data sources make it difficult to identify appropriate data sources for accomplishing the task at hand.
We propose CMDBench, a benchmark modeling the complexity of enterprise data platforms.
arXiv Detail & Related papers (2024-06-02T01:10:41Z) - Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities [1.2535250082638645]
We survey the state of the art in Tabular Data Synthesis ( TDS)
We examine the needs of users by defining a set of functional and non-functional requirements.
We develop a decision guide to help users find suitable TDS tools for their applications.
arXiv Detail & Related papers (2024-05-31T16:00:43Z) - A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis [2.2451409468083114]
We propose a novel correlation- and mean-aware loss function for generative adversarial networks (GANs)
The proposed loss function demonstrates statistically significant improvements over existing methods in capturing the true data distribution.
The benchmarking framework shows that the enhanced synthetic data quality leads to improved performance in downstream machine learning tasks.
arXiv Detail & Related papers (2024-05-27T09:08:08Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Evaluating the Capabilities of Multi-modal Reasoning Models with
Synthetic Task Data [0.0]
We leverage advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks.
We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task.
We demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks.
arXiv Detail & Related papers (2023-06-01T20:56:34Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Abstractive Query Focused Summarization with Query-Free Resources [60.468323530248945]
In this work, we consider the problem of leveraging only generic summarization resources to build an abstractive QFS system.
We propose Marge, a Masked ROUGE Regression framework composed of a novel unified representation for summaries and queries.
Despite learning from minimal supervision, our system achieves state-of-the-art results in the distantly supervised setting.
arXiv Detail & Related papers (2020-12-29T14:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.