Related papers: Generating Wikipedia Article Sections from Diverse Data Sources

Generating Wikipedia Article Sections from Diverse Data Sources

URL: http://arxiv.org/abs/2012.14919v1
Date: Tue, 29 Dec 2020 19:35:34 GMT
Title: Generating Wikipedia Article Sections from Diverse Data Sources
Authors: Mingda Chen, Sam Wiseman, Kevin Gimpel
Abstract summary: We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
Score: 57.23574577984244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.

Related papers

Multi-Scale Heterogeneous Text-Attributed Graph Datasets From Diverse Domains [25.61868709829681]
We introduce a collection of challenging and diverse benchmark datasets for realistic and reproducible evaluation of machine learning models on HTAGs. Our HTAG datasets are multi-scale, span years in duration, and cover a wide range of domains, including movie, community question answering, academic, literature, and patent networks. All source data, dataset construction codes, processed HTAGs, data loaders, benchmark codes, and evaluation setup are publicly available at GitHub and Hugging Face.
arXiv Detail & Related papers (2024-12-12T04:58:32Z)
Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z)
XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages [11.581072296148031]
Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. We propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text.
arXiv Detail & Related papers (2023-03-22T04:52:43Z)
WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles. The dataset consists of over 80k English samples on 6987 topics. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z)
Extending Multi-Text Sentence Fusion Resources via Pyramid Annotations [12.394777121890925]
This paper revisits and substantially extends previous dataset creation efforts. We show that our extended version uses more representative texts for multi-document tasks and provides a larger and more diverse training set.
arXiv Detail & Related papers (2021-10-09T09:15:05Z)
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)
Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG) It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains. Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples. We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
Variational Template Machine for Data-to-Text Generation [37.03488881357614]
We claim that an open set of templates is crucial for enriching the phrase constructions and realizing varied generations. This paper explores the problem of automatically learning reusable "templates" from paired and non-paired data. We propose the variational template machine (VTM), a novel method to generate text descriptions from data tables.
arXiv Detail & Related papers (2020-02-04T04:53:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.