GameWikiSum: a Novel Large Multi-Document Summarization Dataset
- URL: http://arxiv.org/abs/2002.06851v1
- Date: Mon, 17 Feb 2020 09:25:19 GMT
- Title: GameWikiSum: a Novel Large Multi-Document Summarization Dataset
- Authors: Diego Antognini, Boi Faltings
- Abstract summary: GameWikiSum is a new domain-specific dataset for multi-document summarization.
It is one hundred times larger than commonly used datasets, and in another domain than news.
We analyze the proposed dataset and show that both abstractive and extractive models can be trained on it.
- Score: 39.38032088973816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Today's research progress in the field of multi-document summarization is
obstructed by the small number of available datasets. Since the acquisition of
reference summaries is costly, existing datasets contain only hundreds of
samples at most, resulting in heavy reliance on hand-crafted features or
necessitating additional, manually annotated data. The lack of large corpora
therefore hinders the development of sophisticated models. Additionally, most
publicly available multi-document summarization corpora are in the news domain,
and no analogous dataset exists in the video game domain. In this paper, we
propose GameWikiSum, a new domain-specific dataset for multi-document
summarization, which is one hundred times larger than commonly used datasets,
and in another domain than news. Input documents consist of long professional
video game reviews as well as references of their gameplay sections in
Wikipedia pages. We analyze the proposed dataset and show that both abstractive
and extractive models can be trained on it. We release GameWikiSum for further
research: https://github.com/Diego999/GameWikiSum.
Related papers
- MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation
of Videos [106.06278332186106]
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction.
Numerous limitations exist within existing public MSMO datasets.
We have meticulously curated the textbfMMSum dataset.
arXiv Detail & Related papers (2023-06-07T07:43:11Z) - How "Multi" is Multi-Document Summarization? [15.574673241564932]
It is expected that both reference summaries in MDS datasets, as well as system summaries, would indeed be based on dispersed information.
We propose an automated measure for evaluating the degree to which a summary is disperse''
Our results show that certain MDS datasets barely require combining information from multiple documents, where a single document often covers the full summary content.
arXiv Detail & Related papers (2022-10-23T10:20:09Z) - HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow
Articles [8.53502615629675]
We present HowSumm, a novel large-scale dataset for the task of query-focused multi-document summarization (qMDS)
This use-case is different from the use-cases covered in existing multi-document summarization (MDS) datasets and is applicable to educational and industrial scenarios.
We describe the creation of the dataset and discuss the unique features that distinguish it from other summarization corpora.
arXiv Detail & Related papers (2021-10-07T04:44:32Z) - MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News
Summarization [19.062996443574047]
We present a new dataset MiRANews and benchmark existing summarization models.
We show via data analysis that it's not only the models which are to blame.
assisted summarization reduces 55% of hallucinations when compared to single-document summarization models trained on the main article only.
arXiv Detail & Related papers (2021-09-22T10:58:40Z) - DESCGEN: A Distantly Supervised Datasetfor Generating Abstractive Entity
Descriptions [41.80938919728834]
We introduce DESCGEN: given mentions spread over multiple documents, the goal is to generate an entity summary description.
DESCGEN consists of 37K entity descriptions from Wikipedia and Fandom, each paired with nine evidence documents on average.
The resulting summaries are more abstractive than those found in existing datasets and provide a better proxy for the challenge of describing new and emerging entities.
arXiv Detail & Related papers (2021-06-09T20:10:48Z) - SummScreen: A Dataset for Abstractive Screenplay Summarization [52.56760815805357]
SummScreen is a dataset comprised of pairs of TV series transcripts and human written recaps.
Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript.
Since characters are fundamental to TV series, we also propose two entity-centric evaluation metrics.
arXiv Detail & Related papers (2021-04-14T19:37:40Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
Current Events Portal [10.553314461761968]
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries.
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
arXiv Detail & Related papers (2020-05-20T14:33:33Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.