OmniTab: Pretraining with Natural and Synthetic Data for Few-shot
Table-based Question Answering
- URL: http://arxiv.org/abs/2207.03637v1
- Date: Fri, 8 Jul 2022 01:23:45 GMT
- Title: OmniTab: Pretraining with Natural and Synthetic Data for Few-shot
Table-based Question Answering
- Authors: Zhengbao Jiang, Yi Mao, Pengcheng He, Graham Neubig, Weizhu Chen
- Abstract summary: We develop a simple table-based QA model with minimal annotation effort.
We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
- Score: 106.73213656603453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The information in tables can be an important complement to text, making
table-based question answering (QA) systems of great value. The intrinsic
complexity of handling tables often adds an extra burden to both model design
and data annotation. In this paper, we aim to develop a simple table-based QA
model with minimal annotation effort. Motivated by the fact that table-based QA
requires both alignment between questions and tables and the ability to perform
complicated reasoning over multiple table elements, we propose an omnivorous
pretraining approach that consumes both natural and synthetic data to endow
models with these respective abilities. Specifically, given freely available
tables, we leverage retrieval to pair them with relevant natural sentences for
mask-based pretraining, and synthesize NL questions by converting SQL sampled
from tables for pretraining with a QA loss. We perform extensive experiments in
both few-shot and full settings, and the results clearly demonstrate the
superiority of our model OmniTab, with the best multitasking approach achieving
an absolute gain of 16.2% and 2.7% in 128-shot and full settings respectively,
also establishing a new state-of-the-art on WikiTableQuestions. Detailed
ablations and analyses reveal different characteristics of natural and
synthetic data, shedding light on future directions in omnivorous pretraining.
Code, pretraining data, and pretrained models are available at
https://github.com/jzbjyb/OmniTab.
Related papers
- KET-QA: A Dataset for Knowledge Enhanced Table Question Answering [63.56707527868466]
We propose to use a knowledge base (KB) as the external knowledge source for TableQA.
Every question requires the integration of information from both the table and the sub-graph to be answered.
We design a retriever-reasoner structured pipeline model to extract pertinent information from the vast knowledge sub-graph.
arXiv Detail & Related papers (2024-05-13T18:26:32Z) - Testing the Limits of Unified Sequence to Sequence LLM Pretraining on
Diverse Table Data Tasks [2.690048852269647]
We study the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
Our work is the first attempt at studying the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
arXiv Detail & Related papers (2023-10-01T21:06:15Z) - MultiTabQA: Generating Tabular Answers for Multi-Table Question
Answering [61.48881995121938]
Real-world queries are complex in nature, often over multiple tables in a relational database or web page.
Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers.
arXiv Detail & Related papers (2023-05-22T08:25:15Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - ReasTAP: Injecting Table Reasoning Skills During Pre-training via
Synthetic Reasoning Examples [15.212332890570869]
We develop ReasTAP to show that high-level table reasoning skills can be injected into models during pre-training without a complex table-specific architecture design.
ReasTAP achieves new state-of-the-art performance on all benchmarks and delivers a significant improvement on low-resource setting.
arXiv Detail & Related papers (2022-10-22T07:04:02Z) - Table Retrieval May Not Necessitate Table-specific Model Design [83.27735758203089]
We focus on the task of table retrieval, and ask: "is table-specific model design necessary for table retrieval?"
Based on an analysis on a table-based portion of the Natural Questions dataset (NQ-table), we find that structure plays a negligible role in more than 70% of the cases.
We then experiment with three modules to explicitly encode table structures, namely auxiliary row/column embeddings, hard attention masks, and soft relation-based attention biases.
None of these yielded significant improvements, suggesting that table-specific model design may not be necessary for table retrieval.
arXiv Detail & Related papers (2022-05-19T20:35:23Z) - Understanding tables with intermediate pre-training [11.96734018295146]
We adapt TAPAS, a table-based BERT model, to recognize entailment.
We evaluate table pruning techniques as a pre-processing step to drastically improve the training and prediction efficiency.
arXiv Detail & Related papers (2020-10-01T17:43:27Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.