Tabular Data: Deep Learning is Not All You Need
- URL: http://arxiv.org/abs/2106.03253v1
- Date: Sun, 6 Jun 2021 21:22:39 GMT
- Title: Tabular Data: Deep Learning is Not All You Need
- Authors: Ravid Shwartz-Ziv and Amitai Armon
- Abstract summary: A key element of AutoML systems is setting the types of models that will be used for each type of task.
For classification and regression problems with tabular data, the use of tree ensemble models (like XGBoost) is usually recommended.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key element of AutoML systems is setting the types of models that will be
used for each type of task. For classification and regression problems with
tabular data, the use of tree ensemble models (like XGBoost) is usually
recommended. However, several deep learning models for tabular data have
recently been proposed, claiming to outperform XGBoost for some use-cases. In
this paper, we explore whether these deep models should be a recommended option
for tabular data, by rigorously comparing the new deep models to XGBoost on a
variety of datasets. In addition to systematically comparing their accuracy, we
consider the tuning and computation they require. Our study shows that XGBoost
outperforms these deep models across the datasets, including datasets used in
the papers that proposed the deep models. We also demonstrate that XGBoost
requires much less tuning. On the positive side, we show that an ensemble of
the deep models and XGBoost performs better on these datasets than XGBoost
alone.
Related papers
- Generative Active Learning for Long-tailed Instance Segmentation [55.66158205855948]
We propose BSGAL, a new algorithm that estimates the contribution of generated data based on cache gradient.
Experiments show that BSGAL outperforms the baseline approach and effectually improves the performance of long-tailed segmentation.
arXiv Detail & Related papers (2024-06-04T15:57:43Z) - When do Generative Query and Document Expansions Fail? A Comprehensive
Study Across Methods, Retrievers, and Datasets [69.28733312110566]
We conduct the first comprehensive analysis of LM-based expansion.
We find that there exists a strong negative correlation between retriever performance and gains from expansion.
Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format.
arXiv Detail & Related papers (2023-09-15T17:05:43Z) - Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis [50.972595036856035]
We present a code that successfully replicates results from six popular and recent graph recommendation models.
We compare these graph models with traditional collaborative filtering models that historically performed well in offline evaluations.
By investigating the information flow from users' neighborhoods, we aim to identify which models are influenced by intrinsic features in the dataset structure.
arXiv Detail & Related papers (2023-08-01T09:31:44Z) - IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size
of Public Graph Datasets for Deep Learning Research [14.191338008898963]
Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications.
One of the major obstacles in GNN research is the lack of large-scale flexible datasets.
We introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and evaluate GNN models.
arXiv Detail & Related papers (2023-02-27T05:21:35Z) - Why do tree-based models still outperform deep learning on tabular data? [0.0]
We show that tree-based models remain state-of-the-art on medium-sized data.
We conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs)
arXiv Detail & Related papers (2022-07-18T08:36:08Z) - A Robust Stacking Framework for Training Deep Graph Models with
Multifaceted Node Features [61.92791503017341]
Graph Neural Networks (GNNs) with numerical node features and graph structure as inputs have demonstrated superior performance on various supervised learning tasks with graph data.
The best models for such data types in most standard supervised learning settings with IID (non-graph) data are not easily incorporated into a GNN.
Here we propose a robust stacking framework that fuses graph-aware propagation with arbitrary models intended for IID data.
arXiv Detail & Related papers (2022-06-16T22:46:33Z) - KGBoost: A Classification-based Knowledge Base Completion Method with
Negative Sampling [29.14178162494542]
KGBoost is a new method to train a powerful classifier for missing link prediction.
We conduct experiments on multiple benchmark datasets, and demonstrate that KGBoost outperforms state-of-the-art methods across most datasets.
As compared with models trained by end-to-end optimization, KGBoost works well under the low-dimensional setting so as to allow a smaller model size.
arXiv Detail & Related papers (2021-12-17T06:19:37Z) - A Simple and Fast Baseline for Tuning Large XGBoost Models [8.203493207581937]
We show that uniform subsampling makes for a simple yet fast baseline to speed up the tuning of large XGBoost models.
We demonstrate the effectiveness of this baseline on large-scale datasets ranging from $15-70mathrmGB$ in size.
arXiv Detail & Related papers (2021-11-12T20:17:50Z) - Node Feature Extraction by Self-Supervised Multi-scale Neighborhood
Prediction [123.20238648121445]
We propose a new self-supervised learning framework, Graph Information Aided Node feature exTraction (GIANT)
GIANT makes use of the eXtreme Multi-label Classification (XMC) formalism, which is crucial for fine-tuning the language model based on graph information.
We demonstrate the superior performance of GIANT over the standard GNN pipeline on Open Graph Benchmark datasets.
arXiv Detail & Related papers (2021-10-29T19:55:12Z) - An Efficient Learning Framework For Federated XGBoost Using Secret
Sharing And Distributed Optimization [47.70500612425959]
XGBoost is one of the most widely used machine learning models in the industry due to its superior learning accuracy and efficiency.
It is crucial to deploy a secure and efficient federated XGBoost (FedXGB) model to tackle data isolation issues in the big data problems.
In this paper, a multi-party federated XGB learning framework is proposed with a security guarantee, which reshapes the XGBoost's split criterion calculation process under a secret sharing setting.
Remarkably, a thorough analysis of model security is provided as well, and multiple numerical results showcase the superiority of the proposed FedXGB
arXiv Detail & Related papers (2021-05-12T15:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.