Related papers: A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

URL: http://arxiv.org/abs/2005.10070v1
Date: Wed, 20 May 2020 14:33:33 GMT
Title: A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal
Authors: Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John Glover, Georgiana Ifrim
Abstract summary: Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
Score: 10.553314461761968
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.

Related papers

Multi-Record Web Page Information Extraction From News Websites [83.88591755871734]
In this paper, we focus on the problem of extracting information from web pages containing many records. To address this gap, we created a large-scale, open-access dataset specifically designed for list pages. Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity.
arXiv Detail & Related papers (2025-02-20T15:05:00Z)
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets. We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z)
The Power of Summary-Source Alignments [62.76959473193149]
Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection. alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data. This paper proposes extending the summary-source alignment framework by applying it at the more fine-grained proposition span level.
arXiv Detail & Related papers (2024-06-02T19:35:19Z)
Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles [136.84278943588652]
We propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference.
arXiv Detail & Related papers (2023-09-17T20:28:17Z)
DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z)
Generating a Structured Summary of Numerous Academic Papers: Dataset and Method [20.90939310713561]
We propose BigSurvey, the first large-scale dataset for generating comprehensive summaries of numerous academic papers on each topic. We collect target summaries from more than seven thousand survey papers and utilize their 430 thousand reference papers' abstracts as input documents. To organize the diverse content from dozens of input documents, we propose a summarization method named category-based alignment and sparse transformer (CAST)
arXiv Detail & Related papers (2023-02-09T11:42:07Z)
How "Multi" is Multi-Document Summarization? [15.574673241564932]
It is expected that both reference summaries in MDS datasets, as well as system summaries, would indeed be based on dispersed information. We propose an automated measure for evaluating the degree to which a summary is disperse'' Our results show that certain MDS datasets barely require combining information from multiple documents, where a single document often covers the full summary content.
arXiv Detail & Related papers (2022-10-23T10:20:09Z)
Unsupervised Summarization with Customized Granularities [76.26899748972423]
We propose the first unsupervised multi-granularity summarization framework, GranuSum. By inputting different numbers of events, GranuSum is capable of producing multi-granular summaries in an unsupervised manner.
arXiv Detail & Related papers (2022-01-29T05:56:35Z)
HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow Articles [8.53502615629675]
We present HowSumm, a novel large-scale dataset for the task of query-focused multi-document summarization (qMDS) This use-case is different from the use-cases covered in existing multi-document summarization (MDS) datasets and is applicable to educational and industrial scenarios. We describe the creation of the dataset and discuss the unique features that distinguish it from other summarization corpora.
arXiv Detail & Related papers (2021-10-07T04:44:32Z)
Data Augmentation for Abstractive Query-Focused Multi-Document Summarization [129.96147867496205]
We present two QMDS training datasets, which we construct using two data augmentation methods. These two datasets have complementary properties, i.e., QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries. We build end-to-end neural network models on the combined datasets that yield new state-of-the-art transfer results on DUC datasets.
arXiv Detail & Related papers (2021-03-02T16:57:01Z)
WSL-DS: Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization [16.048329028104643]
In the Query Focused Multi-Document Summarization (QF-MDS) task, a set of documents and a query are given where the goal is to generate a summary from these documents. One major challenge for this task is the lack of availability of labeled training datasets. We propose a novel weakly supervised learning approach via utilizing distant supervision.
arXiv Detail & Related papers (2020-11-03T02:02:55Z)
AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization [17.098075160558576]
We propose a scalable approach called AQuaMuSe to automatically mine qMDS examples from question answering datasets and large document corpora. We publicly release a specific instance of an AQuaMuSe dataset with 5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl.
arXiv Detail & Related papers (2020-10-23T22:38:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.