Generating Wikipedia Article Sections from Diverse Data Sources
- URL: http://arxiv.org/abs/2012.14919v1
- Date: Tue, 29 Dec 2020 19:35:34 GMT
- Title: Generating Wikipedia Article Sections from Diverse Data Sources
- Authors: Mingda Chen, Sam Wiseman, Kevin Gimpel
- Abstract summary: We benchmark several training and decoding strategies on WikiTableT.
Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
- Score: 57.23574577984244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Datasets for data-to-text generation typically focus either on multi-domain,
single-sentence generation or on single-domain, long-form generation. In this
work, we create a large-scale dataset, WikiTableT, that pairs Wikipedia
sections with their corresponding tabular data and various metadata. WikiTableT
contains millions of instances, covering a broad range of topics, as well as a
variety of flavors of generation tasks with different levels of flexibility. We
benchmark several training and decoding strategies on WikiTableT. Our
qualitative analysis shows that the best approaches can generate fluent and
high quality texts but they sometimes struggle with coherence.
Related papers
- XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation
in Low Resource Languages [11.581072296148031]
Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages.
We propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text.
arXiv Detail & Related papers (2023-03-22T04:52:43Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Extending Multi-Text Sentence Fusion Resources via Pyramid Annotations [12.394777121890925]
This paper revisits and substantially extends previous dataset creation efforts.
We show that our extended version uses more representative texts for multi-document tasks and provides a larger and more diverse training set.
arXiv Detail & Related papers (2021-10-09T09:15:05Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z) - Variational Template Machine for Data-to-Text Generation [37.03488881357614]
We claim that an open set of templates is crucial for enriching the phrase constructions and realizing varied generations.
This paper explores the problem of automatically learning reusable "templates" from paired and non-paired data.
We propose the variational template machine (VTM), a novel method to generate text descriptions from data tables.
arXiv Detail & Related papers (2020-02-04T04:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.