Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities
- URL: http://arxiv.org/abs/2405.20959v1
- Date: Fri, 31 May 2024 16:00:43 GMT
- Title: Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities
- Authors: Maria F. Davila R., Sven Groen, Fabian Panse, Wolfram Wingerath,
- Abstract summary: We survey the state of the art in Tabular Data Synthesis ( TDS)
We examine the needs of users by defining a set of functional and non-functional requirements.
We develop a decision guide to help users find suitable TDS tools for their applications.
- Score: 1.2535250082638645
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In an era of rapidly advancing data-driven applications, there is a growing demand for data in both research and practice. Synthetic data have emerged as an alternative when no real data is available (e.g., due to privacy regulations). Synthesizing tabular data presents unique and complex challenges, especially handling (i) missing values, (ii) dataset imbalance, (iii) diverse column types, and (iv) complex data distributions, as well as preserving (i) column correlations, (ii) temporal dependencies, and (iii) integrity constraints (e.g., functional dependencies) present in the original dataset. While substantial progress has been made recently in the context of generational models, there is no one-size-fits-all solution for tabular data today, and choosing the right tool for a given task is therefore no trivial task. In this paper, we survey the state of the art in Tabular Data Synthesis (TDS), examine the needs of users by defining a set of functional and non-functional requirements, and compile the challenges associated with meeting those needs. In addition, we evaluate the reported performance of 36 popular research TDS tools about these requirements and develop a decision guide to help users find suitable TDS tools for their applications. The resulting decision guide also identifies significant research gaps.
Related papers
- TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems [10.71630696651595]
Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks have garnered significant interest within database and AI communities.
silos of multimodal data sources make it difficult to identify appropriate data sources for accomplishing the task at hand.
We propose CMDBench, a benchmark modeling the complexity of enterprise data platforms.
arXiv Detail & Related papers (2024-06-02T01:10:41Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections [19.148874215745135]
We study the problem of continually releasing differentially private synthetic data from longitudinal data collections.
We introduce a model where, in every time step, each individual reports a new data element.
We give continual synthetic data generation algorithms that preserve two basic types of queries.
arXiv Detail & Related papers (2023-06-13T16:22:08Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - GAN-based Tabular Data Generator for Constructing Synopsis in
Approximate Query Processing: Challenges and Solutions [0.0]
Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis)
This study explores the novel utilization of Generative Adversarial Networks (GANs) in the generation of tabular data that can be employed in AQP for synopsis construction.
Our findings demonstrate that advanced GAN variations exhibit a promising capacity to generate high-fidelity synopses, potentially transforming the efficiency and effectiveness of AQP in data-driven systems.
arXiv Detail & Related papers (2022-12-18T05:11:04Z) - A Survey of Dataset Refinement for Problems in Computer Vision Datasets [11.45536223418548]
Large-scale datasets have played a crucial role in the advancement of computer vision.
They often suffer from problems such as class imbalance, noisy labels, dataset bias, or high resource costs.
Various data-centric solutions have been proposed to solve the dataset problems.
They improve the quality of datasets by re-organizing them, which we call dataset refinement.
arXiv Detail & Related papers (2022-10-21T03:58:43Z) - PIETS: Parallelised Irregularity Encoders for Forecasting with
Heterogeneous Time-Series [5.911865723926626]
Heterogeneity and irregularity of multi-source data sets present a significant challenge to time-series analysis.
In this work, we design a novel architecture, PIETS, to model heterogeneous time-series.
We show that PIETS is able to effectively model heterogeneous temporal data and outperforms other state-of-the-art approaches in the prediction task.
arXiv Detail & Related papers (2021-09-30T20:01:19Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.