Learning to Summarize Passages: Mining Passage-Summary Pairs from
Wikipedia Revision Histories
- URL: http://arxiv.org/abs/2004.02592v1
- Date: Mon, 6 Apr 2020 12:11:50 GMT
- Title: Learning to Summarize Passages: Mining Passage-Summary Pairs from
Wikipedia Revision Histories
- Authors: Qingyu Zhou, Furu Wei, Ming Zhou
- Abstract summary: We propose a method for automatically constructing a passage-to-summary dataset by mining the Wikipedia page revision histories.
In particular, the method mines the main body passages and the introduction sentences which are added to the pages simultaneously.
The constructed dataset contains more than one hundred thousand passage-summary pairs.
- Score: 110.54963847339775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a method for automatically constructing a
passage-to-summary dataset by mining the Wikipedia page revision histories. In
particular, the method mines the main body passages and the introduction
sentences which are added to the pages simultaneously. The constructed dataset
contains more than one hundred thousand passage-summary pairs. The quality
analysis shows that it is promising that the dataset can be used as a training
and validation set for passage summarization. We validate and analyze the
performance of various summarization systems on the proposed dataset. The
dataset will be available online at https://res.qyzhou.me.
Related papers
- Distantly Supervised Morpho-Syntactic Model for Relation Extraction [0.27195102129094995]
We present a method for the extraction and categorisation of an unrestricted set of relationships from text.
We evaluate our approach on six datasets built on Wikidata and Wikipedia.
arXiv Detail & Related papers (2024-01-18T14:17:40Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Podcast Summary Assessment: A Resource for Evaluating Summary Assessment
Methods [42.08097583183816]
We describe a new dataset, the podcast summary assessment corpus.
This dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus.
arXiv Detail & Related papers (2022-08-28T18:24:41Z) - Efficient Few-Shot Fine-Tuning for Opinion Summarization [83.76460801568092]
Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples.
We show that a few-shot method based on adapters can easily store in-domain knowledge.
We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets.
arXiv Detail & Related papers (2022-05-04T16:38:37Z) - Robust Text Line Detection in Historical Documents: Learning and
Evaluation Methods [1.9938405188113029]
We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net.
We show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages.
arXiv Detail & Related papers (2022-03-23T11:56:25Z) - Improving Zero and Few-Shot Abstractive Summarization with Intermediate
Fine-tuning and Data Augmentation [101.26235068460551]
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks.
Models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains.
We introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner.
arXiv Detail & Related papers (2020-10-24T08:36:49Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z) - WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive
Summarization [41.578594261746055]
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems.
We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors.
We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
arXiv Detail & Related papers (2020-10-07T00:28:05Z) - Exploring Content Selection in Summarization of Novel Chapters [19.11830806780343]
We present a new summarization task, generating summaries of novel chapters using summary/chapter pairs from online study guides.
This is a harder task than the news summarization task, given the chapter length as well as the extreme paraphrasing and generalization found in the summaries.
We focus on extractive summarization, which requires the creation of a gold-standard set of extractive summaries.
arXiv Detail & Related papers (2020-05-04T20:45:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.