Embeddings for Tabular Data: A Survey
- URL: http://arxiv.org/abs/2302.11777v1
- Date: Thu, 23 Feb 2023 04:37:49 GMT
- Title: Embeddings for Tabular Data: A Survey
- Authors: Rajat Singh, Srikanta Bedathur
- Abstract summary: Tabular data comprises rows (samples) with the same set of columns (attributes)
Tables are becoming the natural way of storing data among various industries and academia.
New line of research work applies various learning techniques to support various database tasks.
- Score: 8.010589283146222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data comprising rows (samples) with the same set of columns
(attributes, is one of the most widely used data-type among various industries,
including financial services, health care, research, retail, and logistics, to
name a few. Tables are becoming the natural way of storing data among various
industries and academia. The data stored in these tables serve as an essential
source of information for making various decisions. As computational power and
internet connectivity increase, the data stored by these companies grow
exponentially, and not only do the databases become vast and challenging to
maintain and operate, but the quantity of database tasks also increases. Thus a
new line of research work has been started, which applies various learning
techniques to support various database tasks for such large and complex tables.
In this work, we split the quest of learning on tabular data into two phases:
The Classical Learning Phase and The Modern Machine Learning Phase. The
classical learning phase consists of the models such as SVMs, linear and
logistic regression, and tree-based methods. These models are best suited for
small-size tables. However, the number of tasks these models can address is
limited to classification and regression. In contrast, the Modern Machine
Learning Phase contains models that use deep learning for learning latent space
representation of table entities. The objective of this survey is to scrutinize
the varied approaches used by practitioners to learn representation for the
structured data, and to compare their efficacy.
Related papers
- TabReD: A Benchmark of Tabular Machine Learning in-the-Wild [30.922069185335246]
We show that industry-grade datasets are underrepresented in academic benchmarks for machine learning.
We introduce TabReD, a collection of eight industry-grade datasets covering a wide range of domains.
We show that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits more common in academic benchmarks.
arXiv Detail & Related papers (2024-06-27T17:55:31Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Large Language Model for Table Processing: A Survey [9.144614058716083]
Large Language Models (LLMs) offers significant public benefits, garnering interest from academia and industry.
Tables typically two-dimensional and structured to store large amounts of data, are essential in daily activities like database queries, spreadsheet calculations, and generating reports from web tables.
This survey provides an extensive overview of table tasks, encompassing not only the traditional areas like table question answering (Table QA) and fact verification, but also newly emphasized aspects such as table manipulation and advanced table data analysis.
arXiv Detail & Related papers (2024-02-04T00:47:53Z) - MambaTab: A Plug-and-Play Model for Learning Tabular Data [13.110156202816112]
This work introduces an innovative approach based on a structured state-space model (SSM), MambaTab, for tabular data.
MambaTab delivers superior performance while requiring significantly fewer parameters, as empirically validated on diverse benchmark datasets.
arXiv Detail & Related papers (2024-01-16T22:44:12Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - GFS: Graph-based Feature Synthesis for Prediction over Relational
Databases [39.975491511390985]
We propose a novel framework called Graph-based Feature Synthesis (GFS)
GFS formulates relational database as a heterogeneous graph database.
In an experiment over four real-world multi-table relational databases, GFS outperforms previous methods designed for relational databases.
arXiv Detail & Related papers (2023-12-04T16:54:40Z) - Towards Cross-Table Masked Pretraining for Web Data Mining [22.952238405240188]
We propose an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2.
Our experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
arXiv Detail & Related papers (2023-07-10T02:27:38Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.