A Closer Look at Deep Learning Methods on Tabular Datasets
- URL: http://arxiv.org/abs/2407.00956v3
- Date: Wed, 15 Jan 2025 11:19:30 GMT
- Title: A Closer Look at Deep Learning Methods on Tabular Datasets
- Authors: Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, De-Chuan Zhan,
- Abstract summary: Tabular data is prevalent across diverse domains in machine learning.<n>Deep Neural Network (DNN)-based methods have recently demonstrated promising performance.<n>We compare 32 state-of-the-art deep and tree-based methods, evaluating their average performance across multiple criteria.
- Score: 52.50778536274327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data is prevalent across diverse domains in machine learning. While classical methods like tree-based models have long been effective, Deep Neural Network (DNN)-based methods have recently demonstrated promising performance. However, the diverse characteristics of methods and the inherent heterogeneity of tabular datasets make understanding and interpreting tabular methods both challenging and prone to unstable observations. In this paper, we conduct in-depth evaluations and comprehensive analyses of tabular methods, with a particular focus on DNN-based models, using a benchmark of over 300 tabular datasets spanning a wide range of task types, sizes, and domains. First, we perform an extensive comparison of 32 state-of-the-art deep and tree-based methods, evaluating their average performance across multiple criteria. Although method ranks vary across datasets, we empirically find that top-performing methods tend to concentrate within a small subset of tabular models, regardless of the criteria used. Next, we investigate whether the training dynamics of deep tabular models can be predicted based on dataset properties. This approach not only offers insights into the behavior of deep tabular methods but also identifies a core set of "meta-features" that reflect dataset heterogeneity. The other subset includes datasets where method ranks are consistent with the overall benchmark, acting as a reliable probe for further tabular analysis.
Related papers
- Representation Learning for Tabular Data: A Comprehensive Survey [23.606506938919605]
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications.
Deep Neural Networks (DNNs) have recently demonstrated promising results through their capability of representation learning.
We organize existing methods into three main categories according to their generalization capabilities.
arXiv Detail & Related papers (2025-04-17T17:58:23Z) - Mambular: A Sequential Model for Tabular Deep Learning [0.7184556517162347]
This paper investigates the use of autoregressive state-space models for tabular data.
We compare their performance against established benchmark models.
Our findings indicate that interpreting features as a sequence and processing them can lead to significant performance improvement.
arXiv Detail & Related papers (2024-08-12T16:57:57Z) - Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later [59.88557193062348]
We revisit the classic Neighborhood Component Analysis (NCA), designed to learn a linear projection that captures semantic similarities between instances.
We find that minor modifications, such as adjustments to the learning objectives and the integration of deep learning architectures, significantly enhance NCA's performance.
We also introduce a neighbor sampling strategy that improves both the efficiency and predictive accuracy of our proposed ModernNCA.
arXiv Detail & Related papers (2024-07-03T16:38:57Z) - Is Deep Learning finally better than Decision Trees on Tabular Data? [19.657605376506357]
Tabular data is a ubiquitous data modality due to its versatility and ease of use in many real-world applications.
Recent studies on data offer a unique perspective on the limitations of neural networks in this domain.
Our study categorizes ten state-of-the-art models based on their underlying learning paradigm.
arXiv Detail & Related papers (2024-02-06T12:59:02Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data [9.107782510356989]
We propose a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent.
Grande is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator.
We demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets.
arXiv Detail & Related papers (2023-09-29T10:49:14Z) - A Coreset Learning Reality Check [33.002265576337486]
Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets.
In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification.
We compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness.
arXiv Detail & Related papers (2023-01-15T19:26:17Z) - Improving Data Quality with Training Dynamics of Gradient Boosting
Decision Trees [1.5605040219256345]
We propose a method based on metrics from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example.
We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution.
arXiv Detail & Related papers (2022-10-20T15:02:49Z) - TabLLM: Few-shot Classification of Tabular Data with Large Language
Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification.
We evaluate several serialization methods including templates, table-to-text models, and large language models.
This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z) - Is margin all you need? An extensive empirical study of active learning
on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z) - Why do tree-based models still outperform deep learning on tabular data? [0.0]
We show that tree-based models remain state-of-the-art on medium-sized data.
We conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs)
arXiv Detail & Related papers (2022-07-18T08:36:08Z) - A Topological Approach for Semi-Supervised Learning [0.0]
We present new semi-supervised learning methods based on techniques from Topological Data Analysis (TDA)
In particular, we have created two semi-supervised learning methods following two different topological approaches.
The results show that the methods developed in this work outperform both the results obtained with models trained with only manually labelled data, and those obtained with classical semi-supervised learning methods.
arXiv Detail & Related papers (2022-05-19T15:23:39Z) - A Topological Data Analysis Based Classifier [1.6668132748773563]
This paper proposes an algorithm that applies Topological Data Analysis directly to multi-class classification problems.
The proposed algorithm builds a filtered simplicial complex on the dataset.
On average, the proposed TDABC method was better than KNN and weighted-KNN.
arXiv Detail & Related papers (2021-11-09T15:54:16Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Evaluating the Disentanglement of Deep Generative Models through
Manifold Topology [66.06153115971732]
We present a method for quantifying disentanglement that only uses the generative model.
We empirically evaluate several state-of-the-art models across multiple datasets.
arXiv Detail & Related papers (2020-06-05T20:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.