TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023
- URL: http://arxiv.org/abs/2307.14338v2
- Date: Thu, 26 Oct 2023 17:59:37 GMT
- Title: TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023
- Authors: Yury Gorishniy, Ivan Rubachev, Nikolay Kartashev, Daniil Shlenskii,
Akim Kotelnikov, Artem Babenko
- Abstract summary: We present TabR -- essentially, a feed-forward network with a custom k-Nearest-Neighbors-like component in the middle.
On a set of public benchmarks with datasets up to several million objects, TabR demonstrates the best average performance.
In addition to the much higher performance, TabR is simple and significantly more efficient.
- Score: 33.70333110327871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning (DL) models for tabular data problems (e.g. classification,
regression) are currently receiving increasingly more attention from
researchers. However, despite the recent efforts, the non-DL algorithms based
on gradient-boosted decision trees (GBDT) remain a strong go-to solution for
these problems. One of the research directions aimed at improving the position
of tabular DL involves designing so-called retrieval-augmented models. For a
target object, such models retrieve other objects (e.g. the nearest neighbors)
from the available training data and use their features and labels to make a
better prediction.
In this work, we present TabR -- essentially, a feed-forward network with a
custom k-Nearest-Neighbors-like component in the middle. On a set of public
benchmarks with datasets up to several million objects, TabR marks a big step
forward for tabular DL: it demonstrates the best average performance among
tabular DL models, becomes the new state-of-the-art on several datasets, and
even outperforms GBDT models on the recently proposed "GBDT-friendly" benchmark
(see Figure 1). Among the important findings and technical details powering
TabR, the main ones lie in the attention-like mechanism that is responsible for
retrieving the nearest neighbors and extracting valuable signal from them. In
addition to the much higher performance, TabR is simple and significantly more
efficient compared to prior retrieval-based tabular DL models.
Related papers
- Team up GBDTs and DNNs: Advancing Efficient and Effective Tabular Prediction with Tree-hybrid MLPs [20.67800392863432]
Tabular datasets play a crucial role in various applications.
Two prominent model types, Boosted Decision Trees (GBDTs) and Deep Neural Networks (DNNs), have demonstrated performance advantages on distinct prediction tasks.
This paper proposes a new framework that amalgamates the advantages of both GBDTs and DNNs, resulting in a DNN algorithm that is as efficient as GBDTs and is competitively effective regardless of dataset preferences.
arXiv Detail & Related papers (2024-07-13T07:13:32Z) - Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later [59.88557193062348]
We revisit the classic Neighborhood Component Analysis (NCA), designed to learn a linear projection that captures semantic similarities between instances.
We find that minor modifications, such as adjustments to the learning objectives and the integration of deep learning architectures, significantly enhance NCA's performance.
We also introduce a neighbor sampling strategy that improves both the efficiency and predictive accuracy of our proposed ModernNCA.
arXiv Detail & Related papers (2024-07-03T16:38:57Z) - TabReD: A Benchmark of Tabular Machine Learning in-the-Wild [30.922069185335246]
We show that industry-grade datasets are underrepresented in academic benchmarks for machine learning.
We introduce TabReD, a collection of eight industry-grade datasets covering a wide range of domains.
We show that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits more common in academic benchmarks.
arXiv Detail & Related papers (2024-06-27T17:55:31Z) - 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs [67.47600679176963]
RDBs store vast amounts of rich, informative data spread across interconnected tables.
The progress of predictive machine learning models falls behind advances in other domains such as computer vision or natural language processing.
We explore a class of baseline models predicated on converting multi-table datasets into graphs.
We assemble a diverse collection of large-scale RDB datasets and (ii) coincident predictive tasks.
arXiv Detail & Related papers (2024-04-28T15:04:54Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - When Do Neural Nets Outperform Boosted Trees on Tabular Data? [65.30290020731825]
We take a step back and question the importance of the 'NN vs. GBDT' debate.
For a surprisingly high number of datasets, the performance difference between GBDTs and NNs is negligible.
We analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well.
Our insights act as a guide for practitioners to determine which techniques may work best on their dataset.
arXiv Detail & Related papers (2023-05-04T17:04:41Z) - Why do tree-based models still outperform deep learning on tabular data? [0.0]
We show that tree-based models remain state-of-the-art on medium-sized data.
We conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs)
arXiv Detail & Related papers (2022-07-18T08:36:08Z) - A Large Scale Search Dataset for Unbiased Learning to Rank [51.97967284268577]
We introduce the Baidu-ULTR dataset for unbiased learning to rank.
It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries.
It provides: (1) the original semantic feature and a pre-trained language model for easy usage; (2) sufficient display information such as position, displayed height, and displayed abstract; and (3) rich user feedback on search result pages (SERPs) like dwelling time.
arXiv Detail & Related papers (2022-07-07T02:37:25Z) - Revisiting Deep Learning Models for Tabular Data [40.67427600770095]
It is unclear for both researchers and practitioners what models perform best.
The first one is a ResNet-like architecture which turns out to be a strong baseline that is often missing in prior works.
The second model is our simple adaptation of the Transformer architecture for tabular data, which outperforms other solutions on most tasks.
arXiv Detail & Related papers (2021-06-22T17:58:10Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.