Multi-Document Summarization with Centroid-Based Pretraining
- URL: http://arxiv.org/abs/2208.01006v2
- Date: Wed, 31 May 2023 14:37:32 GMT
- Title: Multi-Document Summarization with Centroid-Based Pretraining
- Authors: Ratish Puduppully and Parag Jain and Nancy F. Chen and Mark Steedman
- Abstract summary: In Multi-Document Summarization (MDS), the input can be modeled as a set of documents, and the output is its summary.
We introduce a novel pretraining objective, which involves selecting the ROUGE-based centroid of each document cluster as a proxy for its summary.
Our objective thus does not require human written summaries and can be utilized for pretraining on a dataset consisting solely of document sets.
- Score: 35.8335939654861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Multi-Document Summarization (MDS), the input can be modeled as a set of
documents, and the output is its summary. In this paper, we focus on
pretraining objectives for MDS. Specifically, we introduce a novel pretraining
objective, which involves selecting the ROUGE-based centroid of each document
cluster as a proxy for its summary. Our objective thus does not require human
written summaries and can be utilized for pretraining on a dataset consisting
solely of document sets. Through zero-shot, few-shot, and fully supervised
experiments on multiple MDS datasets, we show that our model Centrum is better
or comparable to a state-of-the-art model. We make the pretrained and
fine-tuned models freely available to the research community
https://github.com/ratishsp/centrum.
Related papers
- Federated Document Visual Question Answering: A Pilot Study [11.157766332838877]
Documents tend to be copyrighted or contain private information, which prohibits their open publication.
In this work, we explore the use of a federated learning scheme as a way to train a shared model on decentralised private document data.
We show that our pretraining strategies can effectively learn and scale up under federated training with diverse DocVQA datasets.
arXiv Detail & Related papers (2024-05-10T17:53:05Z) - PELMS: Pre-training for Effective Low-Shot Multi-Document Summarization [4.6493060043204535]
We present PELMS, a pre-trained model that generates concise, fluent, and faithful summaries.
We compile MultiPT, a multi-document pre-training corpus containing over 93 million documents to form more than 3 million unlabeled topic-centric document clusters.
Our approach consistently outperforms competitive comparisons with respect to overall informativeness, abstractiveness, coherence, and faithfulness.
arXiv Detail & Related papers (2023-11-16T12:05:23Z) - Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z) - How "Multi" is Multi-Document Summarization? [15.574673241564932]
It is expected that both reference summaries in MDS datasets, as well as system summaries, would indeed be based on dispersed information.
We propose an automated measure for evaluating the degree to which a summary is disperse''
Our results show that certain MDS datasets barely require combining information from multiple documents, where a single document often covers the full summary content.
arXiv Detail & Related papers (2022-10-23T10:20:09Z) - Unsupervised Summarization with Customized Granularities [76.26899748972423]
We propose the first unsupervised multi-granularity summarization framework, GranuSum.
By inputting different numbers of events, GranuSum is capable of producing multi-granular summaries in an unsupervised manner.
arXiv Detail & Related papers (2022-01-29T05:56:35Z) - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document
Summarization [16.830963601598242]
We propose PRIMER, a pre-trained model for multi-document representation with focus on summarization.
Specifically, we adopt the Longformer architecture with proper input transformation and global attention to fit for multi-document inputs.
Our model, PRIMER, outperforms current state-of-the-art models on most of these settings with large margins.
arXiv Detail & Related papers (2021-10-16T07:22:24Z) - WSL-DS: Weakly Supervised Learning with Distant Supervision for Query
Focused Multi-Document Abstractive Summarization [16.048329028104643]
In the Query Focused Multi-Document Summarization (QF-MDS) task, a set of documents and a query are given where the goal is to generate a summary from these documents.
One major challenge for this task is the lack of availability of labeled training datasets.
We propose a novel weakly supervised learning approach via utilizing distant supervision.
arXiv Detail & Related papers (2020-11-03T02:02:55Z) - SupMMD: A Sentence Importance Model for Extractive Summarization using
Maximum Mean Discrepancy [92.5683788430012]
SupMMD is a novel technique for generic and update summarization based on the maximum discrepancy from kernel two-sample testing.
We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
arXiv Detail & Related papers (2020-10-06T09:26:55Z) - SummPip: Unsupervised Multi-Document Summarization with Sentence Graph
Compression [61.97200991151141]
SummPip is an unsupervised method for multi-document summarization.
We convert the original documents to a sentence graph, taking both linguistic and deep representation into account.
We then apply spectral clustering to obtain multiple clusters of sentences, and finally compress each cluster to generate the final summary.
arXiv Detail & Related papers (2020-07-17T13:01:15Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.