TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023
- URL: http://arxiv.org/abs/2307.14338v2
- Date: Thu, 26 Oct 2023 17:59:37 GMT
- Title: TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023
- Authors: Yury Gorishniy, Ivan Rubachev, Nikolay Kartashev, Daniil Shlenskii,
Akim Kotelnikov, Artem Babenko
- Abstract summary: We present TabR -- essentially, a feed-forward network with a custom k-Nearest-Neighbors-like component in the middle.
On a set of public benchmarks with datasets up to several million objects, TabR demonstrates the best average performance.
In addition to the much higher performance, TabR is simple and significantly more efficient.
- Score: 33.70333110327871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning (DL) models for tabular data problems (e.g. classification,
regression) are currently receiving increasingly more attention from
researchers. However, despite the recent efforts, the non-DL algorithms based
on gradient-boosted decision trees (GBDT) remain a strong go-to solution for
these problems. One of the research directions aimed at improving the position
of tabular DL involves designing so-called retrieval-augmented models. For a
target object, such models retrieve other objects (e.g. the nearest neighbors)
from the available training data and use their features and labels to make a
better prediction.
In this work, we present TabR -- essentially, a feed-forward network with a
custom k-Nearest-Neighbors-like component in the middle. On a set of public
benchmarks with datasets up to several million objects, TabR marks a big step
forward for tabular DL: it demonstrates the best average performance among
tabular DL models, becomes the new state-of-the-art on several datasets, and
even outperforms GBDT models on the recently proposed "GBDT-friendly" benchmark
(see Figure 1). Among the important findings and technical details powering
TabR, the main ones lie in the attention-like mechanism that is responsible for
retrieving the nearest neighbors and extracting valuable signal from them. In
addition to the much higher performance, TabR is simple and significantly more
efficient compared to prior retrieval-based tabular DL models.
Related papers
- TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization.
Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks.
TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z) - TableRAG: Million-Token Table Understanding with Language Models [53.039560091592215]
TableRAG is a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding.
TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs.
Our results demonstrate that TableRAG achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
arXiv Detail & Related papers (2024-10-07T04:15:02Z) - A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets [0.6144680854063939]
We introduce a benchmark aimed at better characterizing types of datasets where Deep Learning models excel.
We evaluate 111 datasets with 20 different models, including both regression and classification tasks.
Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy.
arXiv Detail & Related papers (2024-08-27T06:58:52Z) - RelBench: A Benchmark for Deep Learning on Relational Databases [78.52438155603781]
We present RelBench, a public benchmark for solving tasks over databases with graph neural networks.
We use RelBench to conduct the first comprehensive study of Deep Learning infrastructure.
RDL learns better whilst reducing human work needed by more than an order of magnitude.
arXiv Detail & Related papers (2024-07-29T14:46:13Z) - Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later [59.88557193062348]
We revisit the classic Neighborhood Component Analysis (NCA), designed to learn a linear projection that captures semantic similarities between instances.
We find that minor modifications, such as adjustments to the learning objectives and the integration of deep learning architectures, significantly enhance NCA's performance.
We also introduce a neighbor sampling strategy that improves both the efficiency and predictive accuracy of our proposed ModernNCA.
arXiv Detail & Related papers (2024-07-03T16:38:57Z) - TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes [25.169832192255956]
We present TabFM, a neural tabular model for data discovery over data lakes.
We finetune the pretrained model for identifying unionable, joinable, and subset table pairs.
Our results demonstrate significant improvements in F1 scores for search compared to state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-28T17:28:53Z) - 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs [67.47600679176963]
RDBs store vast amounts of rich, informative data spread across interconnected tables.
The progress of predictive machine learning models falls behind advances in other domains such as computer vision or natural language processing.
We explore a class of baseline models predicated on converting multi-table datasets into graphs.
We assemble a diverse collection of large-scale RDB datasets and (ii) coincident predictive tasks.
arXiv Detail & Related papers (2024-04-28T15:04:54Z) - When Do Neural Nets Outperform Boosted Trees on Tabular Data? [65.30290020731825]
We take a step back and question the importance of the 'NN vs. GBDT' debate.
For a surprisingly high number of datasets, the performance difference between GBDTs and NNs is negligible.
We analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well.
Our insights act as a guide for practitioners to determine which techniques may work best on their dataset.
arXiv Detail & Related papers (2023-05-04T17:04:41Z) - Revisiting Deep Learning Models for Tabular Data [40.67427600770095]
It is unclear for both researchers and practitioners what models perform best.
The first one is a ResNet-like architecture which turns out to be a strong baseline that is often missing in prior works.
The second model is our simple adaptation of the Transformer architecture for tabular data, which outperforms other solutions on most tasks.
arXiv Detail & Related papers (2021-06-22T17:58:10Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.