ToTTo: A Controlled Table-To-Text Generation Dataset
- URL: http://arxiv.org/abs/2004.14373v3
- Date: Tue, 6 Oct 2020 06:07:06 GMT
- Title: ToTTo: A Controlled Table-To-Text Generation Dataset
- Authors: Ankur P. Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui,
Bhuwan Dhingra, Diyi Yang, Dipanjan Das
- Abstract summary: ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
- Score: 61.83159452483026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present ToTTo, an open-domain English table-to-text dataset with over
120,000 training examples that proposes a controlled generation task: given a
Wikipedia table and a set of highlighted table cells, produce a one-sentence
description. To obtain generated targets that are natural but also faithful to
the source table, we introduce a dataset construction process where annotators
directly revise existing candidate sentences from Wikipedia. We present
systematic analyses of our dataset and annotation process as well as results
achieved by several state-of-the-art baselines. While usually fluent, existing
methods often hallucinate phrases that are not supported by the table,
suggesting that this dataset can serve as a useful research benchmark for
high-precision conditional text generation.
Related papers
- ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models [58.34560740973768]
We introduce a framework that leverages language models (LMs) to generate literature review tables.
A new dataset of 2,228 literature review tables extracted from ArXiv papers synthesize a total of 7,542 research papers.
We evaluate LMs' abilities to reconstruct reference tables, finding this task benefits from additional context.
arXiv Detail & Related papers (2024-10-25T18:31:50Z) - Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from Text [21.699434525769586]
Existing measures for table quality evaluation fail to capture the overall semantics of the tables.
We propose TabEval, a novel table evaluation strategy that captures table semantics.
To validate our approach, we curate a dataset comprising of text descriptions for 1,250 diverse Wikipedia tables.
arXiv Detail & Related papers (2024-06-21T02:18:03Z) - Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction [36.915250638481986]
We introduce LiveSum, a new benchmark dataset for generating summary tables of competitions based on real-time commentary texts.
We evaluate the performances of state-of-the-art Large Language Models on this task in both fine-tuning and zero-shot settings.
We additionally propose a novel pipeline called $T3$(Text-Tuple-Table) to improve their performances.
arXiv Detail & Related papers (2024-04-22T14:31:28Z) - WikiTableEdit: A Benchmark for Table Editing by Natural Language
Instruction [56.196512595940334]
This paper investigates the performance of Large Language Models (LLMs) in the context of table editing tasks.
We leverage 26,531 tables from the Wiki dataset to generate natural language instructions for six distinct basic operations.
We evaluate several representative large language models on the WikiTableEdit dataset to demonstrate the challenge of this task.
arXiv Detail & Related papers (2024-03-05T13:33:12Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z) - CTE: A Dataset for Contextualized Table Extraction [1.1859913430860336]
The dataset comprises 75k fully annotated pages of scientific papers, including more than 35k tables.
Data are gathered from PubMed Central, merging the information provided by annotations in the PubTables-1M and PubLayNet datasets.
The generated annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis, table detection, structure recognition, and functional analysis.
arXiv Detail & Related papers (2023-02-02T22:38:23Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.