Open Domain Multi-document Summarization: A Comprehensive Study of Model
Brittleness under Retrieval
- URL: http://arxiv.org/abs/2212.10526v3
- Date: Wed, 25 Oct 2023 13:25:20 GMT
- Title: Open Domain Multi-document Summarization: A Comprehensive Study of Model
Brittleness under Retrieval
- Authors: John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Lu
Wang, Arman Cohan
- Abstract summary: Multi-document summarization (MDS) assumes a set of topic-related documents are provided as input.
We study this more challenging setting by formalizing the task and bootstrapping it using existing datasets, retrievers and summarizers.
- Score: 42.73076855699184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-document summarization (MDS) assumes a set of topic-related documents
are provided as input. In practice, this document set is not always available;
it would need to be retrieved given an information need, i.e. a question or
topic statement, a setting we dub "open-domain" MDS. We study this more
challenging setting by formalizing the task and bootstrapping it using existing
datasets, retrievers and summarizers. Via extensive automatic and human
evaluation, we determine: (1) state-of-the-art summarizers suffer large
reductions in performance when applied to open-domain MDS, (2) additional
training in the open-domain setting can reduce this sensitivity to imperfect
retrieval, and (3) summarizers are insensitive to the retrieval of duplicate
documents and the order of retrieved documents, but highly sensitive to other
errors, like the retrieval of irrelevant documents. Based on our results, we
provide practical guidelines to enable future work on open-domain MDS, e.g. how
to choose the number of retrieved documents to summarize. Our results suggest
that new retrieval and summarization methods and annotated resources for
training and evaluation are necessary for further progress in the open-domain
setting.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - ODSum: New Benchmarks for Open Domain Multi-Document Summarization [30.875191848268347]
Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries.
We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets.
arXiv Detail & Related papers (2023-09-16T11:27:34Z) - Questions Are All You Need to Train a Dense Passage Retriever [123.13872383489172]
ART is a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data.
It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question.
arXiv Detail & Related papers (2022-06-21T18:16:31Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Augmenting Document Representations for Dense Retrieval with
Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations.
We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z) - WSL-DS: Weakly Supervised Learning with Distant Supervision for Query
Focused Multi-Document Abstractive Summarization [16.048329028104643]
In the Query Focused Multi-Document Summarization (QF-MDS) task, a set of documents and a query are given where the goal is to generate a summary from these documents.
One major challenge for this task is the lack of availability of labeled training datasets.
We propose a novel weakly supervised learning approach via utilizing distant supervision.
arXiv Detail & Related papers (2020-11-03T02:02:55Z) - AQuaMuSe: Automatically Generating Datasets for Query-Based
Multi-Document Summarization [17.098075160558576]
We propose a scalable approach called AQuaMuSe to automatically mine qMDS examples from question answering datasets and large document corpora.
We publicly release a specific instance of an AQuaMuSe dataset with 5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl.
arXiv Detail & Related papers (2020-10-23T22:38:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.