Related papers: TABBIE: Pretrained Representations of Tabular Data

TABBIE: Pretrained Representations of Tabular Data

URL: http://arxiv.org/abs/2105.02584v1
Date: Thu, 6 May 2021 11:15:16 GMT
Title: TABBIE: Pretrained Representations of Tabular Data
Authors: Hiroshi Iida, Dung Thai, Varun Manjunatha, Mohit Iyyer
Abstract summary: We devise a simple pretraining objective that learns exclusively from tabular data. Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures. A qualitative analysis of our model's learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.
Score: 22.444607481407633
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing work on tabular representation learning jointly models tables and associated text using self-supervised objective functions derived from pretrained language models such as BERT. While this joint pretraining improves tasks involving paired tables and text (e.g., answering questions about tables), we show that it underperforms on tasks that operate over tables without any associated text (e.g., populating missing cells). We devise a simple pretraining objective (corrupt cell detection) that learns exclusively from tabular data and reaches the state-of-the-art on a suite of table based prediction tasks. Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures (cells, rows, and columns), and it also requires far less compute to train. A qualitative analysis of our model's learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.

Related papers

Table Foundation Models: on knowledge pre-training for tabular learning [47.485516405457595]
TARTE is a foundation model that transforms tables to knowledge-enhanced vector representations using the string.<n>Pre-trained on large relational data, TARTE yields representations that facilitate subsequent learning with little additional cost.
arXiv Detail & Related papers (2025-05-20T14:27:51Z)
OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering [106.73213656603453]
We develop a simple table-based QA model with minimal annotation effort. We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
arXiv Detail & Related papers (2022-07-08T01:23:45Z)
Table Retrieval May Not Necessitate Table-specific Model Design [83.27735758203089]
We focus on the task of table retrieval, and ask: "is table-specific model design necessary for table retrieval?" Based on an analysis on a table-based portion of the Natural Questions dataset (NQ-table), we find that structure plays a negligible role in more than 70% of the cases. We then experiment with three modules to explicitly encode table structures, namely auxiliary row/column embeddings, hard attention masks, and soft relation-based attention biases. None of these yielded significant improvements, suggesting that table-specific model design may not be necessary for table retrieval.
arXiv Detail & Related papers (2022-05-19T20:35:23Z)
Table Pre-training: A Survey on Model Architectures, Pretraining Objectives, and Downstream Tasks [37.35651138851127]
A flurry of table pre-training frameworks have been proposed following the success of text and images. Table pre-training usually takes the form of table-text joint pre-training. This survey aims to provide a comprehensive review of different model designs, pre-training objectives, and downstream tasks for table pre-training.
arXiv Detail & Related papers (2022-01-24T15:22:24Z)
TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition [76.06530816349763]
We propose an end-to-end trainable table graph reconstruction network (TGRNet) for table structure recognition. Specifically, the proposed method has two main branches, a cell detection branch and a cell logical location branch, to jointly predict the spatial location and the logical location of different cells.
arXiv Detail & Related papers (2021-06-20T01:57:05Z)
TCN: Table Convolutional Network for Web Table Interpretation [52.32515851633981]
We propose a novel table representation learning approach considering both the intra- and inter-table contextual information. Our method can outperform competitive baselines by +4.8% of F1 for column type prediction and by +4.1% of F1 for column pairwise relation prediction.
arXiv Detail & Related papers (2021-02-17T02:18:10Z)
Learning Better Representation for Tables by Self-Supervised Tasks [23.69766883380125]
We propose two self-supervised tasks, Number Ordering and Significance Ordering, to help to learn better table representation. We test our methods on the widely used dataset ROTOWIRE which consists of NBA game statistic and related news.
arXiv Detail & Related papers (2020-10-15T09:03:38Z)
Understanding tables with intermediate pre-training [11.96734018295146]
We adapt TAPAS, a table-based BERT model, to recognize entailment. We evaluate table pruning techniques as a pre-processing step to drastically improve the training and prediction efficiency.
arXiv Detail & Related papers (2020-10-01T17:43:27Z)
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing. We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar. To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples. We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.