xRFM: Accurate, scalable, and interpretable feature learning models for tabular data
- URL: http://arxiv.org/abs/2508.10053v2
- Date: Thu, 23 Oct 2025 16:47:49 GMT
- Title: xRFM: Accurate, scalable, and interpretable feature learning models for tabular data
- Authors: Daniel Beaglehole, David Holzmüller, Adityanarayanan Radhakrishnan, Mikhail Belkin,
- Abstract summary: In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to adapt to the local structure of the data.<n>We show that xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets.
- Score: 20.626069512520587
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to $31$ other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.
Related papers
- Beyond Softmax: A Natural Parameterization for Categorical Random Variables [61.709831225296305]
We introduce the $textitcatnat$ function, a function composed of a sequence of hierarchical binary splits.<n>A rich set of experiments show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance.
arXiv Detail & Related papers (2025-09-29T12:55:50Z) - Data Selection for ERMs [67.57726352698933]
We study how well can $mathcalA$ perform when trained on at most $nll N$ data points selected from a population of $N$ points.<n>Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression.
arXiv Detail & Related papers (2025-04-20T11:26:01Z) - Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer [2.1677183904102257]
We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without pre-training on any real-world dataset.<n>APT is pre-trained with adversarial synthetic data agents, who deliberately challenge the model with different synthetic datasets.<n>We show that our framework matches state-of-the-art performance on small classification tasks without filtering on dataset characteristics.
arXiv Detail & Related papers (2025-02-06T23:58:11Z) - Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data [39.40116554523575]
We present Drift-Resilient TabPFN, a fresh approach based on In-Context Learning with a Prior-Data Fitted Network.
It learns to approximate Bayesian inference on synthetic datasets drawn from a prior.
It improves accuracy from 0.688 to 0.744 and ROC AUC from 0.786 to 0.832 while maintaining stronger calibration.
arXiv Detail & Related papers (2024-11-15T23:49:23Z) - Distributionally robust self-supervised learning for tabular data [4.172010719137041]
Learning robust representation in presence of error slices is challenging, due to high cardinality features and the complexity of constructing error sets.<n>Traditional robust representation learning methods are largely focused on improving worst group performance in supervised setting in computer vision.<n>Our approach utilizes an encoder-decoder model trained with Masked Language Modeling (MLM) loss to learn robust latent representations.
arXiv Detail & Related papers (2024-10-11T04:23:56Z) - TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models [10.88959673845634]
TabEBM is a class-conditional generative method using Energy-Based Models (EBMs)
Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods.
arXiv Detail & Related papers (2024-09-24T14:25:59Z) - Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later [76.66498833720411]
We introduce a differentiable version of $K$-nearest neighbors (KNN) originally designed to learn a linear projection to capture semantic similarities between instances.<n>Surprisingly, our implementation of NCA using SGD and without dimensionality reduction already achieves decent performance on tabular data.<n>We conclude our paper by analyzing the factors behind these improvements, including loss functions, prediction strategies, and deep architectures.
arXiv Detail & Related papers (2024-07-03T16:38:57Z) - A Closer Look at Deep Learning Methods on Tabular Datasets [52.50778536274327]
Tabular data is prevalent across diverse domains in machine learning.<n>Deep Neural Network (DNN)-based methods have recently demonstrated promising performance.<n>We compare 32 state-of-the-art deep and tree-based methods, evaluating their average performance across multiple criteria.
arXiv Detail & Related papers (2024-07-01T04:24:07Z) - Is Deep Learning finally better than Decision Trees on Tabular Data? [19.657605376506357]
Tabular data is a ubiquitous data modality due to its versatility and ease of use in many real-world applications.<n>Recent studies on data offer a unique perspective on the limitations of neural networks in this domain.<n>Our study categorizes ten state-of-the-art models based on their underlying learning paradigm.
arXiv Detail & Related papers (2024-02-06T12:59:02Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Rank-R FNN: A Tensor-Based Learning Model for High-Order Data
Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters.
First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension.
We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.