Surfer100: Generating Surveys From Web Resources on Wikipedia-style
- URL: http://arxiv.org/abs/2112.06377v1
- Date: Mon, 13 Dec 2021 02:18:01 GMT
- Title: Surfer100: Generating Surveys From Web Resources on Wikipedia-style
- Authors: Irene Li, Alexander Fabbri, Rina Kawamura, Yixin Liu, Xiangru Tang,
Jaesung Tae, Chang Shen, Sally Ma, Tomoe Mizutani, Dragomir Radev
- Abstract summary: We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
- Score: 49.23675182917996
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Fast-developing fields such as Artificial Intelligence (AI) often outpace the
efforts of encyclopedic sources such as Wikipedia, which either do not
completely cover recently-introduced topics or lack such content entirely. As a
result, methods for automatically producing content are valuable tools to
address this information overload. We show that recent advances in pretrained
language modeling can be combined for a two-stage extractive and abstractive
approach for Wikipedia lead paragraph generation. We extend this approach to
generate longer Wikipedia-style summaries with sections and examine how such
methods struggle in this application through detailed studies with 100
reference human-collected surveys. This is the first study on utilizing web
resources for long Wikipedia-style summaries to the best of our knowledge.
Related papers
- Retrieval-based Full-length Wikipedia Generation for Emergent Events [33.81630908675804]
We simulate a real-world scenario where structured full-length Wikipedia documents are generated for emergent events using input retrieved from web sources.
To ensure that Large Language Models (LLMs) are not trained on corpora related to recently occurred events, we select events that have taken place recently and introduce a new benchmark Wiki-GenBen.
We design a comprehensive set of systematic evaluation metrics and baseline methods, to evaluate the capability of LLMs in generating factual full-length Wikipedia documents.
arXiv Detail & Related papers (2024-02-28T11:51:56Z) - Curious Rhythms: Temporal Regularities of Wikipedia Consumption [15.686850035802667]
We show that even after removing the global pattern of day-night alternation, the consumption habits of individual articles maintain strong diurnal regularities.
We investigate topical and contextual correlates of Wikipedia articles' access rhythms, finding that article topic, reader country, and access device (mobile vs. desktop) are all important predictors of daily attention patterns.
arXiv Detail & Related papers (2023-05-16T14:48:08Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Embedding Knowledge for Document Summarization: A Survey [66.76415502727802]
Previous works proved that knowledge-embedded document summarizers excel at generating superior digests.
We propose novel to recapitulate knowledge and knowledge embeddings under the document summarization view.
arXiv Detail & Related papers (2022-04-24T04:36:07Z) - Tracking Knowledge Propagation Across Wikipedia Languages [1.8447697408534176]
We present a dataset of inter-language knowledge propagation in Wikipedia.
The dataset covers the entire 309 language editions and 33M articles.
We find that the size of language editions is associated with the speed of propagation.
arXiv Detail & Related papers (2021-03-30T18:36:13Z) - Generating Wikipedia Article Sections from Diverse Data Sources [57.23574577984244]
We benchmark several training and decoding strategies on WikiTableT.
Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
arXiv Detail & Related papers (2020-12-29T19:35:34Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z) - Entity Extraction from Wikipedia List Pages [2.3605348648054463]
We build a large taxonomy from categories and list pages with DBpedia as a backbone.
With distant supervision, we extract training data for the identification of new entities in list pages.
We extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.
arXiv Detail & Related papers (2020-03-11T07:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.