TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
- URL: http://arxiv.org/abs/2005.08314v1
- Date: Sun, 17 May 2020 17:26:40 GMT
- Title: TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
- Authors: Pengcheng Yin, Graham Neubig, Wen-tau Yih, Sebastian Riedel
- Abstract summary: We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
- Score: 113.29476656550342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have witnessed the burgeoning of pretrained language models
(LMs) for text-based natural language (NL) understanding tasks. Such models are
typically trained on free-form NL text, hence may not be suitable for tasks
like semantic parsing over structured data, which require reasoning over both
free-form NL questions and structured tabular data (e.g., database tables). In
this paper we present TaBERT, a pretrained LM that jointly learns
representations for NL sentences and (semi-)structured tables. TaBERT is
trained on a large corpus of 26 million tables and their English contexts. In
experiments, neural semantic parsers using TaBERT as feature representation
layers achieve new best results on the challenging weakly-supervised semantic
parsing benchmark WikiTableQuestions, while performing competitively on the
text-to-SQL dataset Spider. Implementation of the model will be available at
http://fburl.com/TaBERT .
Related papers
- Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z) - Assessing Linguistic Generalisation in Language Models: A Dataset for
Brazilian Portuguese [4.941630596191806]
We propose a set of intrinsic evaluation tasks that inspect the linguistic information encoded in models developed for Brazilian Portuguese.
These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions.
arXiv Detail & Related papers (2023-05-23T13:49:14Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - TunBERT: Pretrained Contextualized Text Representation for Tunisian
Dialect [0.0]
We investigate the feasibility of training monolingual Transformer-based language models for under represented languages.
We show that the use of noisy web crawled data instead of structured data is more convenient for such non-standardized language.
Our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks.
arXiv Detail & Related papers (2021-11-25T15:49:50Z) - Learning Which Features Matter: RoBERTa Acquires a Preference for
Linguistic Generalizations (Eventually) [25.696099563130517]
We introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set)
MSGS consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning.
We pretrain RoBERTa models from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa-base.
We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones.
arXiv Detail & Related papers (2020-10-11T22:09:27Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z) - Assessing the Bilingual Knowledge Learned by Neural Machine Translation
Models [72.56058378313963]
We bridge the gap by assessing the bilingual knowledge learned by NMT models with phrase table.
We find that NMT models learn patterns from simple to complex and distill essential bilingual knowledge from the training examples.
arXiv Detail & Related papers (2020-04-28T03:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.