Related papers: Why do tree-based models still outperform deep learning on tabular data?

Why do tree-based models still outperform deep learning on tabular data?

URL: http://arxiv.org/abs/2207.08815v1
Date: Mon, 18 Jul 2022 08:36:08 GMT
Title: Why do tree-based models still outperform deep learning on tabular data?
Authors: L\'eo Grinsztajn (SODA), Edouard Oyallon (ISIR, CNRS), Ga\"el Varoquaux (SODA)
Abstract summary: We show that tree-based models remain state-of-the-art on medium-sized data. We conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs)
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data ($\sim$10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20 000 compute hours hyperparameter search for each learner.

Related papers

Representation Learning for Tabular Data: A Comprehensive Survey [23.606506938919605]
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Deep Neural Networks (DNNs) have recently demonstrated promising results through their capability of representation learning. We organize existing methods into three main categories according to their generalization capabilities.
arXiv Detail & Related papers (2025-04-17T17:58:23Z)
Learning Decision Trees as Amortized Structure Inference [59.65621207449269]
We propose a hybrid amortized structure inference approach to learn predictive decision tree ensembles given data. We show that our approach, DT-GFN, outperforms state-of-the-art decision tree and deep learning methods on standard classification benchmarks.
arXiv Detail & Related papers (2025-03-10T07:05:07Z)
Escaping the Forest: Sparse Interpretable Neural Networks for Tabular Data [0.0]
We show that our models, Sparse TABular NET or sTAB-Net with attention mechanisms, are more effective than tree-based models. They achieve better performance than post-hoc methods like SHAP.
arXiv Detail & Related papers (2024-10-23T10:50:07Z)
Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later [76.66498833720411]
We introduce a differentiable version of $K$-nearest neighbors (KNN) originally designed to learn a linear projection to capture semantic similarities between instances. Surprisingly, our implementation of NCA using SGD and without dimensionality reduction already achieves decent performance on tabular data. We conclude our paper by analyzing the factors behind these improvements, including loss functions, prediction strategies, and deep architectures.
arXiv Detail & Related papers (2024-07-03T16:38:57Z)
A Closer Look at Deep Learning on Tabular Data [52.50778536274327]
Tabular data is prevalent across various domains in machine learning. Deep Neural Network (DNN)-based methods have shown promising performance comparable to tree-based ones.
arXiv Detail & Related papers (2024-07-01T04:24:07Z)
Is Deep Learning finally better than Decision Trees on Tabular Data? [19.657605376506357]
Tabular data is a ubiquitous data modality due to its versatility and ease of use in many real-world applications. Recent studies on data offer a unique perspective on the limitations of neural networks in this domain. Our study categorizes ten state-of-the-art models based on their underlying learning paradigm.
arXiv Detail & Related papers (2024-02-06T12:59:02Z)
GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data [9.107782510356989]
We propose a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. Grande is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator. We demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets.
arXiv Detail & Related papers (2023-09-29T10:49:14Z)
When Do Neural Nets Outperform Boosted Trees on Tabular Data? [65.30290020731825]
We take a step back and question the importance of the 'NN vs. GBDT' debate. For a surprisingly high number of datasets, the performance difference between GBDTs and NNs is negligible. We analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset.
arXiv Detail & Related papers (2023-05-04T17:04:41Z)
Graph Neural Network contextual embedding for Deep Learning on Tabular Data [0.45880283710344055]
Deep Learning (DL) has constituted a major breakthrough for AI in fields related to human skills like natural language processing. This paper presents a novel DL model using Graph Neural Network (GNN) more specifically Interaction Network (IN) Its results outperform those of a recently published survey with DL benchmark based on five public datasets, also achieving competitive results when compared to boosted-tree solutions.
arXiv Detail & Related papers (2023-03-11T17:13:24Z)
Is margin all you need? An extensive empirical study of active learning on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark. Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z)
A Large Scale Search Dataset for Unbiased Learning to Rank [51.97967284268577]
We introduce the Baidu-ULTR dataset for unbiased learning to rank. It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries. It provides: (1) the original semantic feature and a pre-trained language model for easy usage; (2) sufficient display information such as position, displayed height, and displayed abstract; and (3) rich user feedback on search result pages (SERPs) like dwelling time.
arXiv Detail & Related papers (2022-07-07T02:37:25Z)
Transfer Learning with Deep Tabular Models [66.67017691983182]
We show that upstream data gives tabular neural networks a decisive advantage over GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning. We propose a pseudo-feature method for cases where the upstream and downstream feature sets differ.
arXiv Detail & Related papers (2022-06-30T14:24:32Z)
A Robust Stacking Framework for Training Deep Graph Models with Multifaceted Node Features [61.92791503017341]
Graph Neural Networks (GNNs) with numerical node features and graph structure as inputs have demonstrated superior performance on various supervised learning tasks with graph data. The best models for such data types in most standard supervised learning settings with IID (non-graph) data are not easily incorporated into a GNN. Here we propose a robust stacking framework that fuses graph-aware propagation with arbitrary models intended for IID data.
arXiv Detail & Related papers (2022-06-16T22:46:33Z)
Hopular: Modern Hopfield Networks for Tabular Data [5.470026407471584]
We suggest "Hopular", a novel Deep Learning architecture for medium- and small-sized datasets. Hopular uses stored data to identify feature-feature, feature-target, and sample-sample dependencies. In experiments on small-sized datasets with less than 1,000 samples, Hopular surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods.
arXiv Detail & Related papers (2022-06-01T17:57:44Z)
Deep Neural Networks and Tabular Data: A Survey [6.940394595795544]
This work provides an overview of state-of-the-art deep learning methods for tabular data. We start by categorizing them into three groups: data transformations, specialized architectures, and regularization models. We then provide a comprehensive overview of the main approaches in each group.
arXiv Detail & Related papers (2021-10-05T09:22:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.