Related papers: Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

URL: http://arxiv.org/abs/2309.09968v3
Date: Mon, 19 Feb 2024 21:48:33 GMT
Title: Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees
Authors: Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman
Abstract summary: Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) data. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost.
Score: 11.732842929815401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.

Related papers

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes [135.68092471784516]
We propose a simple and lightweight approach for fusing large language models and gradient-boosted decision trees. We name our fusion methods LLM-Boost and PFN-Boost, respectively. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms.
arXiv Detail & Related papers (2025-02-04T19:30:41Z)
NRGBoost: Energy-Based Generative Boosted Trees [1.0878040851638]
We propose an energy-based generative boosting algorithm that is analogous to the second-order boosting implemented in popular libraries like XGBoost. We show that, despite producing a generative model capable of handling inference tasks over any input variable, our proposed algorithm can achieve similar discriminative performance to GBDT. At the same time, we show that it is also competitive with neural-network-based models for sampling.
arXiv Detail & Related papers (2024-10-04T15:54:02Z)
Unmasking Trees for Tabular Data [0.0]
We present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees. To solve the conditional generation subproblem, we propose BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.
arXiv Detail & Related papers (2024-07-08T04:15:43Z)
Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later [76.66498833720411]
We introduce a differentiable version of $K$-nearest neighbors (KNN) originally designed to learn a linear projection to capture semantic similarities between instances. Surprisingly, our implementation of NCA using SGD and without dimensionality reduction already achieves decent performance on tabular data. We conclude our paper by analyzing the factors behind these improvements, including loss functions, prediction strategies, and deep architectures.
arXiv Detail & Related papers (2024-07-03T16:38:57Z)
BUFF: Boosted Decision Tree based Ultra-Fast Flow matching [3.23055518616474]
Tabular data is one of the most frequently encountered types in high energy physics. We adopt the very recent generative modeling class named conditional flow matching and employ different techniques to integrate the usage of Gradient Boosted Trees. We demonstrate the training and inference time of most high-level simulation tasks can achieve speedup by orders of magnitude.
arXiv Detail & Related papers (2024-04-28T15:31:20Z)
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second [48.87527918630822]
We present TabPFN, a trained Transformer that can do supervised classification for small datasets in less than a second. TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples. We show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$times$ speedup.
arXiv Detail & Related papers (2022-07-05T07:17:43Z)
Hopular: Modern Hopfield Networks for Tabular Data [5.470026407471584]
We suggest "Hopular", a novel Deep Learning architecture for medium- and small-sized datasets. Hopular uses stored data to identify feature-feature, feature-target, and sample-sample dependencies. In experiments on small-sized datasets with less than 1,000 samples, Hopular surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods.
arXiv Detail & Related papers (2022-06-01T17:57:44Z)
A Framework and Benchmark for Deep Batch Active Learning for Regression [2.093287944284448]
We study active learning methods that adaptively select batches of unlabeled data for labeling. We present a framework for constructing such methods out of (network-dependent) base kernels, kernel transformations, and selection methods. Our proposed method outperforms the state-of-the-art on our benchmark, scales to large data sets, and works out-of-the-box without adjusting the network architecture or training code.
arXiv Detail & Related papers (2022-03-17T16:11:36Z)
OCT-GAN: Neural ODE-based Conditional Tabular GANs [8.062118111791495]
We introduce our generator and discriminator based on neural ordinary differential equations (NODEs) We conduct experiments with 13 datasets, including but not limited to insurance fraud detection, online news article prediction, and so on.
arXiv Detail & Related papers (2021-05-31T13:58:55Z)
Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only. The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z)
Heuristic Semi-Supervised Learning for Graph Generation Inspired by Electoral College [80.67842220664231]
We propose a novel pre-processing technique, namely ELectoral COllege (ELCO), which automatically expands new nodes and edges to refine the label similarity within a dense subgraph. In all setups tested, our method boosts the average score of base models by a large margin of 4.7 points, as well as consistently outperforms the state-of-the-art.
arXiv Detail & Related papers (2020-06-10T14:48:48Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.