Retrieval-Based Transformer for Table Augmentation
- URL: http://arxiv.org/abs/2306.11843v1
- Date: Tue, 20 Jun 2023 18:51:21 GMT
- Title: Retrieval-Based Transformer for Table Augmentation
- Authors: Michael Glass, Xueqing Wu, Ankita Rajaram Naik, Gaetano Rossiello,
Alfio Gliozzo
- Abstract summary: We introduce a novel approach toward automatic data wrangling.
We aim to address table augmentation tasks, including row/column population and data imputation.
Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.
- Score: 14.460363647772745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data preparation, also called data wrangling, is considered one of the most
expensive and time-consuming steps when performing analytics or building
machine learning models. Preparing data typically involves collecting and
merging data from complex heterogeneous, and often large-scale data sources,
such as data lakes. In this paper, we introduce a novel approach toward
automatic data wrangling in an attempt to alleviate the effort of end-users,
e.g. data analysts, in structuring dynamic views from data lakes in the form of
tabular data. We aim to address table augmentation tasks, including row/column
population and data imputation. Given a corpus of tables, we propose a
retrieval augmented self-trained transformer model. Our self-learning strategy
consists in randomly ablating tables from the corpus and training the
retrieval-based model to reconstruct the original values or headers given the
partial tables as input. We adopt this strategy to first train the dense neural
retrieval model encoding table-parts to vectors, and then the end-to-end model
trained to perform table augmentation tasks. We test on EntiTables, the
standard benchmark for table augmentation, as well as introduce a new benchmark
to advance further research: WebTables. Our model consistently and
substantially outperforms both supervised statistical methods and the current
state-of-the-art transformer-based models.
Related papers
- Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications [9.457938949410583]
TabRepo is a new dataset of model evaluations and predictions.
It contains the predictions and metrics of 1310 models evaluated on 200 datasets.
arXiv Detail & Related papers (2023-11-06T09:17:18Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - REaLTabFormer: Generating Realistic Relational and Tabular Data using
Transformers [0.0]
We introduce REaLTabFormer (Realistic and Tabular Transformer), a synthetic data generation model.
It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence model.
Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a model baseline.
arXiv Detail & Related papers (2023-02-04T00:32:50Z) - Leveraging Data Recasting to Enhance Tabular Reasoning [21.970920861791015]
Prior work has mostly relied on two data generation strategies.
The first is human annotation, which yields linguistically diverse data but is difficult to scale.
The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness.
arXiv Detail & Related papers (2022-11-23T00:04:57Z) - Scientific evidence extraction [0.0]
We propose a new dataset, Tables One Million (PubTables-1M), and a new class of metric, PubMed grid table similarity (GriTS)
PubTables-1M is nearly twice as large as the previous largest comparable dataset.
We show that object detection models trained on PubTables-1M produce excellent results out-of-the-box for all three tasks of detection, structure recognition, and functional analysis.
arXiv Detail & Related papers (2021-09-30T19:42:07Z) - Capturing Row and Column Semantics in Transformer Based Question
Answering over Tables [9.347393642549806]
We show that one can achieve superior performance on table QA task without using any of these specialized pre-training techniques.
Experiments on recent benchmarks prove that the proposed methods can effectively locate cell values on tables (up to 98% Hit@1 accuracy on Wiki lookup questions)
arXiv Detail & Related papers (2021-04-16T18:22:30Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z) - Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network.
PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.