Korean-Specific Dataset for Table Question Answering
- URL: http://arxiv.org/abs/2201.06223v1
- Date: Mon, 17 Jan 2022 05:47:44 GMT
- Title: Korean-Specific Dataset for Table Question Answering
- Authors: Changwook Jun, Jooyoung Choi, Myoseop Sim, Hyun Kim, Hansol Jang,
Kyungkoo Min
- Abstract summary: We build Korean-specific datasets for table question answering written in English.
Korean table question answering corpus consists of 70k pairs of questions and answers created by crowd-sourced workers.
We make our datasets publicly available via our GitHub repository.
- Score: 3.7056358801102682
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Existing question answering systems mainly focus on dealing with text data.
However, much of the data produced daily is stored in the form of tables that
can be found in documents and relational databases, or on the web. To solve the
task of question answering over tables, there exist many datasets for table
question answering written in English, but few Korean datasets. In this paper,
we demonstrate how we construct Korean-specific datasets for table question
answering: Korean tabular dataset is a collection of 1.4M tables with
corresponding descriptions for unsupervised pre-training language models.
Korean table question answering corpus consists of 70k pairs of questions and
answers created by crowd-sourced workers. Subsequently, we then build a
pre-trained language model based on Transformer, and fine-tune the model for
table question answering with these datasets. We then report the evaluation
results of our model. We make our datasets publicly available via our GitHub
repository, and hope that those datasets will help further studies for question
answering over tables, and for transformation of table formats.
Related papers
- TANQ: An open domain dataset of table answered questions [15.323690523538572]
TANQ is the first open domain question answering dataset where the answers require building tables from information across multiple sources.
We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups.
Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points.
arXiv Detail & Related papers (2024-05-13T14:07:20Z) - WikiTableEdit: A Benchmark for Table Editing by Natural Language
Instruction [56.196512595940334]
This paper investigates the performance of Large Language Models (LLMs) in the context of table editing tasks.
We leverage 26,531 tables from the Wiki dataset to generate natural language instructions for six distinct basic operations.
We evaluate several representative large language models on the WikiTableEdit dataset to demonstrate the challenge of this task.
arXiv Detail & Related papers (2024-03-05T13:33:12Z) - Augment before You Try: Knowledge-Enhanced Table Question Answering via
Table Expansion [57.53174887650989]
Table question answering is a popular task that assesses a model's ability to understand and interact with structured data.
Existing methods either convert both the table and external knowledge into text, which neglects the structured nature of the table.
We propose a simple yet effective method to integrate external information in a given table.
arXiv Detail & Related papers (2024-01-28T03:37:11Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z) - MultiTabQA: Generating Tabular Answers for Multi-Table Question
Answering [61.48881995121938]
Real-world queries are complex in nature, often over multiple tables in a relational database or web page.
Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers.
arXiv Detail & Related papers (2023-05-22T08:25:15Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - TableQuery: Querying tabular data with natural language [0.0]
In TableQuery, we use deep learning models pre-trained for question answering on free text to convert natural language queries to structured queries.
Deep learning models pre-trained for question answering on free text are readily available on platforms such as HuggingFace Model Hub.
TableQuery does not require re-training; when a newly trained model for question answering with better performance is available, it can replace the existing model in TableQuery.
arXiv Detail & Related papers (2022-01-27T17:26:25Z) - PeCoQ: A Dataset for Persian Complex Question Answering over Knowledge
Graph [0.0]
This paper introduces textitPeCoQ, a dataset for Persian question answering.
This dataset contains 10,000 complex questions and answers extracted from the Persian knowledge graph, FarsBase.
There are different types of complexities in the dataset, such as multi-relation, multi-entity, ordinal, and temporal constraints.
arXiv Detail & Related papers (2021-06-27T08:21:23Z) - Summarizing and Exploring Tabular Data in Conversational Search [36.14882974814593]
We build a new conversation-oriented, open-domain table summarization dataset.
It includes annotated table summaries, which not only answer questions but also help people explore other information in the table.
We utilize this dataset to develop automatic table summarization systems as SOTA baselines.
arXiv Detail & Related papers (2020-05-23T08:29:51Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.