Embeddings for Tabular Data: A Survey
- URL: http://arxiv.org/abs/2302.11777v1
- Date: Thu, 23 Feb 2023 04:37:49 GMT
- Title: Embeddings for Tabular Data: A Survey
- Authors: Rajat Singh, Srikanta Bedathur
- Abstract summary: Tabular data comprises rows (samples) with the same set of columns (attributes)
Tables are becoming the natural way of storing data among various industries and academia.
New line of research work applies various learning techniques to support various database tasks.
- Score: 8.010589283146222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data comprising rows (samples) with the same set of columns
(attributes, is one of the most widely used data-type among various industries,
including financial services, health care, research, retail, and logistics, to
name a few. Tables are becoming the natural way of storing data among various
industries and academia. The data stored in these tables serve as an essential
source of information for making various decisions. As computational power and
internet connectivity increase, the data stored by these companies grow
exponentially, and not only do the databases become vast and challenging to
maintain and operate, but the quantity of database tasks also increases. Thus a
new line of research work has been started, which applies various learning
techniques to support various database tasks for such large and complex tables.
In this work, we split the quest of learning on tabular data into two phases:
The Classical Learning Phase and The Modern Machine Learning Phase. The
classical learning phase consists of the models such as SVMs, linear and
logistic regression, and tree-based methods. These models are best suited for
small-size tables. However, the number of tasks these models can address is
limited to classification and regression. In contrast, the Modern Machine
Learning Phase contains models that use deep learning for learning latent space
representation of table entities. The objective of this survey is to scrutinize
the varied approaches used by practitioners to learn representation for the
structured data, and to compare their efficacy.
Related papers
- PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization [7.036380633387952]
We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing.
It can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks.
arXiv Detail & Related papers (2024-10-17T13:05:44Z) - RelBench: A Benchmark for Deep Learning on Relational Databases [78.52438155603781]
We present RelBench, a public benchmark for solving tasks over databases with graph neural networks.
We use RelBench to conduct the first comprehensive study of Deep Learning infrastructure.
RDL learns better whilst reducing human work needed by more than an order of magnitude.
arXiv Detail & Related papers (2024-07-29T14:46:13Z) - MambaTab: A Plug-and-Play Model for Learning Tabular Data [13.110156202816112]
This work introduces an innovative approach based on a structured state-space model (SSM), MambaTab, for tabular data.
MambaTab delivers superior performance while requiring significantly fewer parameters, as empirically validated on diverse benchmark datasets.
arXiv Detail & Related papers (2024-01-16T22:44:12Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - GFS: Graph-based Feature Synthesis for Prediction over Relational
Databases [39.975491511390985]
We propose a novel framework called Graph-based Feature Synthesis (GFS)
GFS formulates relational database as a heterogeneous graph database.
In an experiment over four real-world multi-table relational databases, GFS outperforms previous methods designed for relational databases.
arXiv Detail & Related papers (2023-12-04T16:54:40Z) - Testing the Limits of Unified Sequence to Sequence LLM Pretraining on
Diverse Table Data Tasks [2.690048852269647]
We study the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
Our work is the first attempt at studying the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
arXiv Detail & Related papers (2023-10-01T21:06:15Z) - Towards Cross-Table Masked Pretraining for Web Data Mining [22.952238405240188]
We propose an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2.
Our experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
arXiv Detail & Related papers (2023-07-10T02:27:38Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.