A Closer Look at Deep Learning Methods on Tabular Datasets
- URL: http://arxiv.org/abs/2407.00956v4
- Date: Fri, 07 Nov 2025 09:03:35 GMT
- Title: A Closer Look at Deep Learning Methods on Tabular Datasets
- Authors: Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, De-Chuan Zhan,
- Abstract summary: We present an extensive study on TALENT, a collection of 300+ datasets spanning broad ranges of size.<n>Our evaluation shows that ensembling benefits both tree-based and neural approaches.
- Score: 78.61845513154502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data is prevalent across diverse domains in machine learning. With the rapid progress of deep tabular prediction methods, especially pretrained (foundation) models, there is a growing need to evaluate these methods systematically and to understand their behavior. We present an extensive study on TALENT, a collection of 300+ datasets spanning broad ranges of size, feature composition (numerical/categorical mixes), domains, and output types (binary, multi--class, regression). Our evaluation shows that ensembling benefits both tree-based and neural approaches. Traditional gradient-boosted trees remain very strong baselines, yet recent pretrained tabular models now match or surpass them on many tasks, narrowing--but not eliminating--the historical advantage of tree ensembles. Despite architectural diversity, top performance concentrates within a small subset of models, providing practical guidance for method selection. To explain these outcomes, we quantify dataset heterogeneity by learning from meta-features and early training dynamics to predict later validation behavior. This dynamics-aware analysis indicates that heterogeneity--such as the interplay of categorical and numerical attributes--largely determines which family of methods is favored. Finally, we introduce a two-level design beyond the 300 common-size datasets: a compact TALENT-tiny core (45 datasets) for rapid, reproducible evaluation, and a TALENT-extension suite targeting high-dimensional, many-class, and very large-scale settings for stress testing. In summary, these results offer actionable insights into the strengths, limitations, and future directions for improving deep tabular learning.
Related papers
- Long-Tailed Recognition via Information-Preservable Two-Stage Learning [6.2471093754692815]
The imbalance (or long-tail) is the nature of many real-world data distributions.<n>We propose a novel two-stage learning approach to mitigate such a majority-biased tendency.<n>Our approach achieves the state-of-the-art performance across various long-tailed benchmark datasets.
arXiv Detail & Related papers (2025-10-09T21:49:12Z) - Make Still Further Progress: Chain of Thoughts for Tabular Data Leaderboard [27.224577475861214]
Tabular data, a fundamental data format in machine learning, is predominantly utilized in competitions and real-world applications.<n>We propose an in-context ensemble framework for tabular prediction that leverages large language models.<n>Our method constructs a context around each test instance using its nearest neighbors and the predictions from a pool of external models.
arXiv Detail & Related papers (2025-05-19T17:52:58Z) - Representation Learning for Tabular Data: A Comprehensive Survey [23.606506938919605]
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications.
Deep Neural Networks (DNNs) have recently demonstrated promising results through their capability of representation learning.
We organize existing methods into three main categories according to their generalization capabilities.
arXiv Detail & Related papers (2025-04-17T17:58:23Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Learning from Neighbors: Category Extrapolation for Long-Tail Learning [62.30734737735273]
We offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance.<n>We introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes.<n>To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss.
arXiv Detail & Related papers (2024-10-21T13:06:21Z) - Mambular: A Sequential Model for Tabular Deep Learning [0.7184556517162347]
This paper investigates the use of autoregressive state-space models for tabular data.
We compare their performance against established benchmark models.
Our findings indicate that interpreting features as a sequence and processing them can lead to significant performance improvement.
arXiv Detail & Related papers (2024-08-12T16:57:57Z) - Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later [59.88557193062348]
We revisit the classic Neighborhood Component Analysis (NCA), designed to learn a linear projection that captures semantic similarities between instances.
We find that minor modifications, such as adjustments to the learning objectives and the integration of deep learning architectures, significantly enhance NCA's performance.
We also introduce a neighbor sampling strategy that improves both the efficiency and predictive accuracy of our proposed ModernNCA.
arXiv Detail & Related papers (2024-07-03T16:38:57Z) - Is Deep Learning finally better than Decision Trees on Tabular Data? [19.657605376506357]
Tabular data is a ubiquitous data modality due to its versatility and ease of use in many real-world applications.
Recent studies on data offer a unique perspective on the limitations of neural networks in this domain.
Our study categorizes ten state-of-the-art models based on their underlying learning paradigm.
arXiv Detail & Related papers (2024-02-06T12:59:02Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - GRANDE: Gradient-Based Decision Tree Ensembles for Tabular Data [9.107782510356989]
We propose a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent.
Grande is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator.
We demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets.
arXiv Detail & Related papers (2023-09-29T10:49:14Z) - Deep networks for system identification: a Survey [56.34005280792013]
System identification learns mathematical descriptions of dynamic systems from input-output data.
Main aim of the identified model is to predict new data from previous observations.
We discuss architectures commonly adopted in the literature, like feedforward, convolutional, and recurrent networks.
arXiv Detail & Related papers (2023-01-30T12:38:31Z) - A Coreset Learning Reality Check [33.002265576337486]
Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets.
In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification.
We compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness.
arXiv Detail & Related papers (2023-01-15T19:26:17Z) - Improving Data Quality with Training Dynamics of Gradient Boosting
Decision Trees [1.5605040219256345]
We propose a method based on metrics from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example.
We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution.
arXiv Detail & Related papers (2022-10-20T15:02:49Z) - TabLLM: Few-shot Classification of Tabular Data with Large Language
Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification.
We evaluate several serialization methods including templates, table-to-text models, and large language models.
This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z) - Is margin all you need? An extensive empirical study of active learning
on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z) - Why do tree-based models still outperform deep learning on tabular data? [0.0]
We show that tree-based models remain state-of-the-art on medium-sized data.
We conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs)
arXiv Detail & Related papers (2022-07-18T08:36:08Z) - A Topological Approach for Semi-Supervised Learning [0.0]
We present new semi-supervised learning methods based on techniques from Topological Data Analysis (TDA)
In particular, we have created two semi-supervised learning methods following two different topological approaches.
The results show that the methods developed in this work outperform both the results obtained with models trained with only manually labelled data, and those obtained with classical semi-supervised learning methods.
arXiv Detail & Related papers (2022-05-19T15:23:39Z) - A Topological Data Analysis Based Classifier [1.6668132748773563]
This paper proposes an algorithm that applies Topological Data Analysis directly to multi-class classification problems.
The proposed algorithm builds a filtered simplicial complex on the dataset.
On average, the proposed TDABC method was better than KNN and weighted-KNN.
arXiv Detail & Related papers (2021-11-09T15:54:16Z) - Towards Open-World Feature Extrapolation: An Inductive Graph Learning
Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning.
Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Deep tree-ensembles for multi-output prediction [0.0]
We propose a novel deep tree-ensemble (DTE) model, where every layer enriches the original feature set with a representation learning component based on tree-embeddings.
We specifically focus on two structured output prediction tasks, namely multi-label classification and multi-target regression.
arXiv Detail & Related papers (2020-11-03T16:25:54Z) - Evaluating the Disentanglement of Deep Generative Models through
Manifold Topology [66.06153115971732]
We present a method for quantifying disentanglement that only uses the generative model.
We empirically evaluate several state-of-the-art models across multiple datasets.
arXiv Detail & Related papers (2020-06-05T20:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.