MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News
Summarization
- URL: http://arxiv.org/abs/2109.10650v1
- Date: Wed, 22 Sep 2021 10:58:40 GMT
- Title: MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News
Summarization
- Authors: Xinnuo Xu, Ond\v{r}ej Du\v{s}ek, Shashi Narayan, Verena Rieser and
Ioannis Konstas
- Abstract summary: We present a new dataset MiRANews and benchmark existing summarization models.
We show via data analysis that it's not only the models which are to blame.
assisted summarization reduces 55% of hallucinations when compared to single-document summarization models trained on the main article only.
- Score: 19.062996443574047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the most challenging aspects of current single-document news
summarization is that the summary often contains 'extrinsic hallucinations',
i.e., facts that are not present in the source document, which are often
derived via world knowledge. This causes summarization systems to act more like
open-ended language models tending to hallucinate facts that are erroneous. In
this paper, we mitigate this problem with the help of multiple supplementary
resource documents assisting the task. We present a new dataset MiRANews and
benchmark existing summarization models. In contrast to multi-document
summarization, which addresses multiple events from several source documents,
we still aim at generating a summary for a single document. We show via data
analysis that it's not only the models which are to blame: more than 27% of
facts mentioned in the gold summaries of MiRANews are better grounded on
assisting documents than in the main source articles. An error analysis of
generated summaries from pretrained models fine-tuned on MiRANews reveals that
this has an even bigger effects on models: assisted summarization reduces 55%
of hallucinations when compared to single-document summarization models trained
on the main article only. Our code and data are available at
https://github.com/XinnuoXu/MiRANews.
Related papers
- Shaping Political Discourse using multi-source News Summarization [0.46040036610482665]
We have developed a machine learning model that generates a concise summary of a topic from multiple news documents.
The model is designed to be unbiased by sampling its input equally from all the different aspects of the topic.
arXiv Detail & Related papers (2023-12-18T21:03:46Z) - Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles [136.84278943588652]
We propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event.
To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm.
The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference.
arXiv Detail & Related papers (2023-09-17T20:28:17Z) - Correcting Diverse Factual Errors in Abstractive Summarization via
Post-Editing and Language Model Infilling [56.70682379371534]
We show that our approach vastly outperforms prior methods in correcting erroneous summaries.
Our model -- FactEdit -- improves factuality scores by over 11 points on CNN/DM and over 31 points on XSum.
arXiv Detail & Related papers (2022-10-22T07:16:19Z) - Unsupervised Summarization with Customized Granularities [76.26899748972423]
We propose the first unsupervised multi-granularity summarization framework, GranuSum.
By inputting different numbers of events, GranuSum is capable of producing multi-granular summaries in an unsupervised manner.
arXiv Detail & Related papers (2022-01-29T05:56:35Z) - Modeling Endorsement for Multi-Document Abstractive Summarization [10.166639983949887]
A crucial difference between single- and multi-document summarization is how salient content manifests itself in the document(s)
In this paper, we model the cross-document endorsement effect and its utilization in multiple document summarization.
Our method generates a synopsis from each document, which serves as an endorser to identify salient content from other documents.
arXiv Detail & Related papers (2021-10-15T03:55:42Z) - Topic Modeling Based Extractive Text Summarization [0.0]
We propose a novel method to summarize a text document by clustering its contents based on latent topics.
We utilize the lesser used and challenging WikiHow dataset in our approach to text summarization.
arXiv Detail & Related papers (2021-06-29T12:28:19Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Corpora Evaluation and System Bias Detection in Multi-document
Summarization [25.131744693121508]
Multi-document summarization (MDS) is the task of reflecting key points from any set of documents into a concise text paragraph.
Owing to no standard definition of the task, we encounter a plethora of datasets with varying levels of overlap and conflict between participating documents.
New systems report results on a set of chosen datasets, which might not correlate with their performance on the other datasets.
arXiv Detail & Related papers (2020-10-05T05:25:43Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z) - GameWikiSum: a Novel Large Multi-Document Summarization Dataset [39.38032088973816]
GameWikiSum is a new domain-specific dataset for multi-document summarization.
It is one hundred times larger than commonly used datasets, and in another domain than news.
We analyze the proposed dataset and show that both abstractive and extractive models can be trained on it.
arXiv Detail & Related papers (2020-02-17T09:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.