TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models
- URL: http://arxiv.org/abs/2409.16118v3
- Date: Wed, 6 Nov 2024 14:34:16 GMT
- Title: TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models
- Authors: Andrei Margeloiu, Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik,
- Abstract summary: TabEBM is a class-conditional generative method using Energy-Based Models (EBMs)
Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods.
- Score: 10.88959673845634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data collection is often difficult in critical fields such as medicine, physics, and chemistry. As a result, classification methods usually perform poorly with these small datasets, leading to weak predictive performance. Increasing the training set with additional synthetic data, similar to data augmentation in images, is commonly believed to improve downstream classification performance. However, current tabular generative methods that learn either the joint distribution $ p(\mathbf{x}, y) $ or the class-conditional distribution $ p(\mathbf{x} \mid y) $ often overfit on small datasets, resulting in poor-quality synthetic data, usually worsening classification performance compared to using real data alone. To solve these challenges, we introduce TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs). Unlike existing methods that use a shared model to approximate all class-conditional densities, our key innovation is to create distinct EBM generative models for each class, each modelling its class-specific data distribution individually. This approach creates robust energy landscapes, even in ambiguous class distributions. Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods. When used for data augmentation, our synthetic data consistently improves the classification performance across diverse datasets of various sizes, especially small ones. Code is available at https://github.com/andreimargeloiu/TabEBM.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - Machine Unlearning using a Multi-GAN based Model [0.0]
This article presents a new machine unlearning approach that utilizes multiple Generative Adversarial Network (GAN) based models.
The proposed method comprises two phases: i) data reorganization in which synthetic data using the GAN model is introduced with inverted class labels of the forget datasets, and ii) fine-tuning the pre-trained model.
arXiv Detail & Related papers (2024-07-26T02:28:32Z) - FissionFusion: Fast Geometric Generation and Hierarchical Souping for Medical Image Analysis [0.7751705157998379]
The scarcity of well-annotated medical datasets requires leveraging transfer learning from broader datasets like ImageNet or pre-trained models like CLIP.
Model soups averages multiple fine-tuned models aiming to improve performance on In-Domain (ID) tasks and enhance robustness against Out-of-Distribution (OOD) datasets.
We propose a hierarchical merging approach that involves local and global aggregation of models at various levels.
arXiv Detail & Related papers (2024-03-20T06:48:48Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Continual Learning with Optimal Transport based Mixture Model [17.398605698033656]
We propose an online mixture model learning approach based on nice properties of the mature optimal transport theory (OT-MM)
Our proposed method can significantly outperform the current state-of-the-art baselines.
arXiv Detail & Related papers (2022-11-30T06:40:29Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Efficient Classification with Counterfactual Reasoning and Active
Learning [4.708737212700907]
Methods called CCRAL combine causal reasoning to learn counterfactual samples for the original training samples and active learning to select useful counterfactual samples based on a region of uncertainty.
Experiments show that CCRAL achieves significantly better performance than those of the baselines in terms of accuracy and AUC.
arXiv Detail & Related papers (2022-07-25T12:03:40Z) - Field-wise Learning for Multi-field Categorical Data [27.100048708707593]
We propose a new method for learning with multi-field categorical data.
In doing this, the models can be fitted to each category and thus can better capture the underlying differences in data.
The experiment results on two large-scale datasets show the superior performance of our model.
arXiv Detail & Related papers (2020-12-01T01:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.