Generating and Imputing Tabular Data via Diffusion and Flow-based
Gradient-Boosted Trees
- URL: http://arxiv.org/abs/2309.09968v3
- Date: Mon, 19 Feb 2024 21:48:33 GMT
- Title: Generating and Imputing Tabular Data via Diffusion and Flow-based
Gradient-Boosted Trees
- Authors: Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman
- Abstract summary: Tabular data is hard to acquire and is subject to missing values.
This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) data.
In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost.
- Score: 11.732842929815401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data is hard to acquire and is subject to missing values. This paper
introduces a novel approach for generating and imputing mixed-type (continuous
and categorical) tabular data utilizing score-based diffusion and conditional
flow matching. In contrast to prior methods that rely on neural networks to
learn the score function or the vector field, we adopt XGBoost, a widely used
Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the
most extensive benchmarks for tabular data generation and imputation,
containing 27 diverse datasets and 9 metrics. Through empirical evaluation
across the benchmark, we demonstrate that our approach outperforms
deep-learning generation methods in data generation tasks and remains
competitive in data imputation. Notably, it can be trained in parallel using
CPUs without requiring a GPU. Our Python and R code is available at
https://github.com/SamsungSAILMontreal/ForestDiffusion.
Related papers
- Unmasking Trees for Tabular Data [0.0]
We present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees.
To solve the conditional generation subproblem, we propose BaltoBot, which fits a balanced tree of boosted tree classifiers.
Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions.
We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.
arXiv Detail & Related papers (2024-07-08T04:15:43Z) - BUFF: Boosted Decision Tree based Ultra-Fast Flow matching [3.23055518616474]
Tabular data is one of the most frequently encountered types in high energy physics.
We adopt the very recent generative modeling class named conditional flow matching and employ different techniques to integrate the usage of Gradient Boosted Trees.
We demonstrate the training and inference time of most high-level simulation tasks can achieve speedup by orders of magnitude.
arXiv Detail & Related papers (2024-04-28T15:31:20Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - TabPFN: A Transformer That Solves Small Tabular Classification Problems
in a Second [48.87527918630822]
We present TabPFN, a trained Transformer that can do supervised classification for small datasets in less than a second.
TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples.
We show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$times$ speedup.
arXiv Detail & Related papers (2022-07-05T07:17:43Z) - Hopular: Modern Hopfield Networks for Tabular Data [5.470026407471584]
We suggest "Hopular", a novel Deep Learning architecture for medium- and small-sized datasets.
Hopular uses stored data to identify feature-feature, feature-target, and sample-sample dependencies.
In experiments on small-sized datasets with less than 1,000 samples, Hopular surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods.
arXiv Detail & Related papers (2022-06-01T17:57:44Z) - A Framework and Benchmark for Deep Batch Active Learning for Regression [2.093287944284448]
We study active learning methods that adaptively select batches of unlabeled data for labeling.
We present a framework for constructing such methods out of (network-dependent) base kernels, kernel transformations, and selection methods.
Our proposed method outperforms the state-of-the-art on our benchmark, scales to large data sets, and works out-of-the-box without adjusting the network architecture or training code.
arXiv Detail & Related papers (2022-03-17T16:11:36Z) - OCT-GAN: Neural ODE-based Conditional Tabular GANs [8.062118111791495]
We introduce our generator and discriminator based on neural ordinary differential equations (NODEs)
We conduct experiments with 13 datasets, including but not limited to insurance fraud detection, online news article prediction, and so on.
arXiv Detail & Related papers (2021-05-31T13:58:55Z) - Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data
via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only.
The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z) - Heuristic Semi-Supervised Learning for Graph Generation Inspired by
Electoral College [80.67842220664231]
We propose a novel pre-processing technique, namely ELectoral COllege (ELCO), which automatically expands new nodes and edges to refine the label similarity within a dense subgraph.
In all setups tested, our method boosts the average score of base models by a large margin of 4.7 points, as well as consistently outperforms the state-of-the-art.
arXiv Detail & Related papers (2020-06-10T14:48:48Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.