Targeted synthetic data generation for tabular data via hardness characterization
- URL: http://arxiv.org/abs/2410.00759v2
- Date: Sat, 08 Feb 2025 13:33:51 GMT
- Title: Targeted synthetic data generation for tabular data via hardness characterization
- Authors: Tommaso Ferracci, Leonie Tabea Goldmann, Anton Hinel, Francesco Sanna Passino,
- Abstract summary: We introduce a simple augmentation pipeline that generates only high-value training points based on hardness characterization.
Our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.
- Score: 0.0
- License:
- Abstract: Data augmentation via synthetic data generation has been shown to be effective in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a simple augmentation pipeline that generates only high-value training points based on hardness characterization, in a computationally efficient manner. We first empirically demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterization tasks, while offering significant computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on a number of tabular datasets. Our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Neural Dynamic Data Valuation [4.286118155737111]
We propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV)
Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state.
In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states.
arXiv Detail & Related papers (2024-04-30T13:39:26Z) - Synthetic Information towards Maximum Posterior Ratio for deep learning
on Imbalanced Data [1.7495515703051119]
We propose a technique for data balancing by generating synthetic data for the minority class.
Our method prioritizes balancing the informative regions by identifying high entropy samples.
Our experimental results on forty-one datasets demonstrate the superior performance of our technique.
arXiv Detail & Related papers (2024-01-05T01:08:26Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Towards Robust Dataset Learning [90.2590325441068]
We propose a principled, tri-level optimization to formulate the robust dataset learning problem.
Under an abstraction model that characterizes robust vs. non-robust features, the proposed method provably learns a robust dataset.
arXiv Detail & Related papers (2022-11-19T17:06:10Z) - Efficient Classification with Counterfactual Reasoning and Active
Learning [4.708737212700907]
Methods called CCRAL combine causal reasoning to learn counterfactual samples for the original training samples and active learning to select useful counterfactual samples based on a region of uncertainty.
Experiments show that CCRAL achieves significantly better performance than those of the baselines in terms of accuracy and AUC.
arXiv Detail & Related papers (2022-07-25T12:03:40Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z) - Towards Synthetic Multivariate Time Series Generation for Flare
Forecasting [5.098461305284216]
One of the limiting factors in training data-driven, rare-event prediction algorithms is the scarcity of the events of interest.
In this study, we explore the usefulness of the conditional generative adversarial network (CGAN) as a means to perform data-informed oversampling.
arXiv Detail & Related papers (2021-05-16T22:23:23Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.