PDSum: Prototype-driven Continuous Summarization of Evolving
Multi-document Sets Stream
- URL: http://arxiv.org/abs/2302.05550v1
- Date: Fri, 10 Feb 2023 23:43:46 GMT
- Title: PDSum: Prototype-driven Continuous Summarization of Evolving
Multi-document Sets Stream
- Authors: Susik Yoon, Hou Pong Chan, Jiawei Han
- Abstract summary: We propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS)
We introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization.
PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents.
- Score: 33.68263291948121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Summarizing text-rich documents has been long studied in the literature, but
most of the existing efforts have been made to summarize a static and
predefined multi-document set. With the rapid development of online platforms
for generating and distributing text-rich documents, there arises an urgent
need for continuously summarizing dynamically evolving multi-document sets
where the composition of documents and sets is changing over time. This is
especially challenging as the summarization should be not only effective in
incorporating relevant, novel, and distinctive information from each concurrent
multi-document set, but also efficient in serving online applications. In this
work, we propose a new summarization problem, Evolving Multi-Document sets
stream Summarization (EMDS), and introduce a novel unsupervised algorithm PDSum
with the idea of prototype-driven continuous summarization. PDSum builds a
lightweight prototype of each multi-document set and exploits it to adapt to
new documents while preserving accumulated knowledge from previous documents.
To update new summaries, the most representative sentences for each
multi-document set are extracted by measuring their similarities to the
prototypes. A thorough evaluation with real multi-document sets streams
demonstrates that PDSum outperforms state-of-the-art unsupervised
multi-document summarization algorithms in EMDS in terms of relevance, novelty,
and distinctiveness and is also robust to various evaluation settings.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z) - A Hierarchical Encoding-Decoding Scheme for Abstractive Multi-document
Summarization [66.08074487429477]
Pre-trained language models (PLMs) have achieved outstanding achievements in abstractive single-document summarization (SDS)
We propose a new method to better utilize a PLM to facilitate multi-document interactions for the multi-document summarization (MDS) task.
Our method outperforms its corresponding PLM backbone by up to 3 Rouge-L and is favored by humans.
arXiv Detail & Related papers (2023-05-15T10:03:31Z) - Mining both Commonality and Specificity from Multiple Documents for
Multi-Document Summarization [1.4629756274247374]
The multi-document summarization task requires the designed summarizer to generate a short text that covers the important information of original documents.
This paper proposes a multi-document summarization approach based on hierarchical clustering of documents.
arXiv Detail & Related papers (2023-03-05T14:25:05Z) - Large-Scale Multi-Document Summarization with Information Extraction and
Compression [31.601707033466766]
We develop an abstractive summarization framework independent of labeled data for multiple heterogeneous documents.
Our framework processes documents telling different stories instead of documents on the same topic.
Our experiments demonstrate that our framework outperforms current state-of-the-art methods in this more generic setting.
arXiv Detail & Related papers (2022-05-01T19:49:15Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Modeling Endorsement for Multi-Document Abstractive Summarization [10.166639983949887]
A crucial difference between single- and multi-document summarization is how salient content manifests itself in the document(s)
In this paper, we model the cross-document endorsement effect and its utilization in multiple document summarization.
Our method generates a synopsis from each document, which serves as an endorser to identify salient content from other documents.
arXiv Detail & Related papers (2021-10-15T03:55:42Z) - WSL-DS: Weakly Supervised Learning with Distant Supervision for Query
Focused Multi-Document Abstractive Summarization [16.048329028104643]
In the Query Focused Multi-Document Summarization (QF-MDS) task, a set of documents and a query are given where the goal is to generate a summary from these documents.
One major challenge for this task is the lack of availability of labeled training datasets.
We propose a novel weakly supervised learning approach via utilizing distant supervision.
arXiv Detail & Related papers (2020-11-03T02:02:55Z) - SupMMD: A Sentence Importance Model for Extractive Summarization using
Maximum Mean Discrepancy [92.5683788430012]
SupMMD is a novel technique for generic and update summarization based on the maximum discrepancy from kernel two-sample testing.
We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
arXiv Detail & Related papers (2020-10-06T09:26:55Z) - Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning [5.109216329453963]
We introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks.
We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion.
arXiv Detail & Related papers (2020-09-30T05:39:04Z) - SummPip: Unsupervised Multi-Document Summarization with Sentence Graph
Compression [61.97200991151141]
SummPip is an unsupervised method for multi-document summarization.
We convert the original documents to a sentence graph, taking both linguistic and deep representation into account.
We then apply spectral clustering to obtain multiple clusters of sentences, and finally compress each cluster to generate the final summary.
arXiv Detail & Related papers (2020-07-17T13:01:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.