TWAG: A Topic-Guided Wikipedia Abstract Generator
- URL: http://arxiv.org/abs/2106.15135v1
- Date: Tue, 29 Jun 2021 07:42:08 GMT
- Title: TWAG: A Topic-Guided Wikipedia Abstract Generator
- Authors: Fangwei Zhu, Shangqing Tu, Jiaxin Shi, Juanzi Li, Lei Hou, Tong Cui
- Abstract summary: Wikipedia abstract generation aims to distill a Wikipedia abstract from web sources and has met significant success.
Previous works generally view the abstract as plain text, ignoring the fact that it is a description of a certain entity and can be decomposed into different topics.
We propose a two-stage model TWAG that guides the abstract generation with topical information.
- Score: 23.937804531845938
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Wikipedia abstract generation aims to distill a Wikipedia abstract from web
sources and has met significant success by adopting multi-document
summarization techniques. However, previous works generally view the abstract
as plain text, ignoring the fact that it is a description of a certain entity
and can be decomposed into different topics. In this paper, we propose a
two-stage model TWAG that guides the abstract generation with topical
information. First, we detect the topic of each input paragraph with a
classifier trained on existing Wikipedia articles to divide input documents
into different topics. Then, we predict the topic distribution of each abstract
sentence, and decode the sentence from topic-aware representations with a
Pointer-Generator network. We evaluate our model on the WikiCatSum dataset, and
the results show that \modelnames outperforms various existing baselines and is
capable of generating comprehensive abstracts. Our code and dataset can be
accessed at \url{https://github.com/THU-KEG/TWAG}
Related papers
- GoSum: Extractive Summarization of Long Documents by Reinforcement
Learning and Graph Organized discourse state [6.4805900740861]
We propose GoSum, a reinforcement-learning-based extractive model for long-paper summarization.
GoSum encodes states by building a heterogeneous graph from different discourse levels for each input document.
We evaluate the model on two datasets of scientific articles summarization: PubMed and arXiv.
arXiv Detail & Related papers (2022-11-18T14:07:29Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Keyphrase Generation Beyond the Boundaries of Title and Abstract [28.56508031460787]
Keyphrase generation aims at generating phrases (keyphrases) that best describe a given document.
In this work, we explore whether the integration of additional data from semantically similar articles or from the full text of the given article can be helpful for a neural keyphrase generation model.
We discover that adding sentences from the full text particularly in the form of summary of the article can significantly improve the generation of both types of keyphrases.
arXiv Detail & Related papers (2021-12-13T16:33:01Z) - DESCGEN: A Distantly Supervised Datasetfor Generating Abstractive Entity
Descriptions [41.80938919728834]
We introduce DESCGEN: given mentions spread over multiple documents, the goal is to generate an entity summary description.
DESCGEN consists of 37K entity descriptions from Wikipedia and Fandom, each paired with nine evidence documents on average.
The resulting summaries are more abstractive than those found in existing datasets and provide a better proxy for the challenge of describing new and emerging entities.
arXiv Detail & Related papers (2021-06-09T20:10:48Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - Topic-Guided Abstractive Text Summarization: a Joint Learning Approach [19.623946402970933]
We introduce a new approach for abstractive text summarization, Topic-Guided Abstractive Summarization.
The idea is to incorporate neural topic modeling with a Transformer-based sequence-to-sequence (seq2seq) model in a joint learning framework.
arXiv Detail & Related papers (2020-10-20T14:45:25Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Few-Shot Learning for Opinion Summarization [117.70510762845338]
Opinion summarization is the automatic creation of text reflecting subjective information expressed in multiple documents.
In this work, we show that even a handful of summaries is sufficient to bootstrap generation of the summary text.
Our approach substantially outperforms previous extractive and abstractive methods in automatic and human evaluation.
arXiv Detail & Related papers (2020-04-30T15:37:38Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.