STaSy: Score-based Tabular data Synthesis
- URL: http://arxiv.org/abs/2210.04018v4
- Date: Mon, 29 May 2023 06:37:50 GMT
- Title: STaSy: Score-based Tabular data Synthesis
- Authors: Jayoung Kim, Chaejeong Lee, Noseong Park
- Abstract summary: We present a new model named Score-based Tabular data Synthesis (STaSy)
Our training strategy includes a self-paced learning technique and a fine-tuning strategy.
In our experiments, our method outperforms existing methods in terms of task-dependant evaluations and diversity.
- Score: 10.292096717484698
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tabular data synthesis is a long-standing research topic in machine learning.
Many different methods have been proposed over the past decades, ranging from
statistical methods to deep generative methods. However, it has not always been
successful due to the complicated nature of real-world tabular data. In this
paper, we present a new model named Score-based Tabular data Synthesis (STaSy)
and its training strategy based on the paradigm of score-based generative
modeling. Despite the fact that score-based generative models have resolved
many issues in generative models, there still exists room for improvement in
tabular data synthesis. Our proposed training strategy includes a self-paced
learning technique and a fine-tuning strategy, which further increases the
sampling quality and diversity by stabilizing the denoising score matching
training. Furthermore, we also conduct rigorous experimental studies in terms
of the generative task trilemma: sampling quality, diversity, and time. In our
experiments with 15 benchmark tabular datasets and 7 baselines, our method
outperforms existing methods in terms of task-dependant evaluations and
diversity. Code is available at https://github.com/JayoungKim408/STaSy.
Related papers
- CorrSynth -- A Correlated Sampling Method for Diverse Dataset Generation from LLMs [5.89889361990138]
Large language models (LLMs) have demonstrated remarkable performance in diverse tasks using zero-shot and few-shot prompting.
In this work, we tackle the challenge of generating datasets with high diversity, upon which a student model is trained for downstream tasks.
Taking the route of decoding-time guidance-based approaches, we propose Corr Synth, which generates data that is more diverse and faithful to the input prompt using a correlated sampling strategy.
arXiv Detail & Related papers (2024-11-13T12:09:23Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - A Discrepancy Aware Framework for Robust Anomaly Detection [51.710249807397695]
We present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies.
Our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance.
Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance.
arXiv Detail & Related papers (2023-10-11T15:21:40Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Efficient Classification with Counterfactual Reasoning and Active
Learning [4.708737212700907]
Methods called CCRAL combine causal reasoning to learn counterfactual samples for the original training samples and active learning to select useful counterfactual samples based on a region of uncertainty.
Experiments show that CCRAL achieves significantly better performance than those of the baselines in terms of accuracy and AUC.
arXiv Detail & Related papers (2022-07-25T12:03:40Z) - Contemporary Symbolic Regression Methods and their Relative Performance [5.285811942108162]
We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems.
For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity.
For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise.
arXiv Detail & Related papers (2021-07-29T22:12:59Z) - Foundations of Bayesian Learning from Synthetic Data [1.6249267147413522]
We use a Bayesian paradigm to characterise the updating of model parameters when learning on synthetic data.
Recent results from general Bayesian updating support a novel and robust approach to synthetic-learning founded on decision theory.
arXiv Detail & Related papers (2020-11-16T21:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.